Basic Usage

This guide covers fundamental operations with Red Arrow, from creating tables to performing data transformations.

Getting Started

First, require the Arrow library:

require 'arrow'

Creating Tables

From Ruby Hash

The simplest way to create a table is from a Ruby hash. Data types are automatically detected:

table = Arrow::Table.new(
  'name' => ['Alice', 'Bob', 'Charlie'],
  'age' => [25, 30, 35],
  'salary' => [50000.0, 60000.0, 75000.0]
)

puts table.to_s
# Output:
#         name	  age	    salary
#      (string)	(int64)	  (double)
# 0       Alice	     25	  50000.0
# 1         Bob	     30	  60000.0
# 2     Charlie	     35	  75000.0

From Arrays

Create tables using Arrow array types:

count_array = Arrow::UInt32Array.new([0, 2, nil, 4])
visible_array = Arrow::BooleanArray.new([true, nil, nil, false])

table = Arrow::Table.new(
  'count' => count_array,
  'visible' => visible_array
)

With Explicit Schema

Define schema explicitly for precise control:

# Define fields
count_field = Arrow::Field.new('count', :uint32)
visible_field = Arrow::Field.new('visible', :boolean)
schema = Arrow::Schema.new([count_field, visible_field])

# Create arrays
count_array = Arrow::UInt32Array.new([0, 2, nil, 4])
visible_array = Arrow::BooleanArray.new([true, nil, nil, false])

# Create table with schema
table = Arrow::Table.new(schema, [count_array, visible_array])

From Raw Records

Create tables from arrays of records:

schema = {
  count: :uint32,
  visible: :boolean
}

raw_records = [
  [0, true],
  [2, nil],
  [nil, nil],
  [4, false]
]

table = Arrow::Table.new(schema, raw_records)

Loading and Saving Data

Loading from Files

# Load Arrow IPC file
table = Arrow::Table.load('data.arrow')

# Load CSV
table = Arrow::Table.load('data.csv', format: :csv)

# Load Parquet (requires red-parquet)
require 'parquet'
table = Arrow::Table.load('data.parquet', format: :parquet)

Loading from S3

With red-arrow-dataset, load directly from S3:

require 'arrow-dataset'

# Public bucket
table = Arrow::Table.load(URI('s3://bucket/data.csv'))

# Private bucket with credentials
require 'cgi/util'
access_key = 'YOUR_ACCESS_KEY'
secret_key = 'YOUR_SECRET_KEY'
uri = URI("s3://#{CGI.escape(access_key)}:#{CGI.escape(secret_key)}@bucket/data.parquet")
table = Arrow::Table.load(uri)

Loading from HTTP

require 'net/http'

params = {
  query: "SELECT * FROM table LIMIT 10 FORMAT Arrow",
  user: "username",
  password: "password"
}
uri = URI('https://example.com/query')
uri.query = URI.encode_www_form(params)
resp = Net::HTTP.get(uri)

table = Arrow::Table.load(Arrow::Buffer.new(resp))

Saving Tables

# Save as Arrow IPC file
table.save('output.arrow')

# Save as CSV
table.save('output.csv', format: :csv)

# Save as Parquet
require 'parquet'
table.save('output.parquet', format: :parquet)

Accessing Data

Column Access

# Access column by name
age_column = table['age']
age_column = table[:age]

# Access column by index
first_column = table.columns[0]

# Get column names
table.column_names  # => ['name', 'age', 'salary']

# Get number of columns
table.n_columns  # => 3

Row Access

# Get number of rows
table.n_rows
table.size
table.length

# Access single row (returns Arrow::Record)
row = table.slice(0)
row['name']   # => 'Alice'
row['age']    # => 25

# Iterate over rows
table.each_record_batch do |record_batch|
  record_batch.each do |record|
    puts "#{record['name']}: #{record['age']}"
  end
end

Filtering Data

Using Slicer

Red Arrow provides a powerful slicer syntax for filtering:

table = Arrow::Table.new(
  'name' => ['Tom', 'Max', 'Kate'],
  'age' => [22, 23, 19]
)

# Simple condition
result = table.slice { |slicer| slicer['age'] > 19 }
# Returns rows where age > 19

# Range condition
result = table.slice { |slicer| slicer['age'].in?(19..22) }
# Returns rows where age is between 19 and 22

Combining Conditions

Use logical operators to combine filters:

# AND (&)
result = table.slice { |slicer|
  (slicer['age'] > 19) & (slicer['age'] < 23)
}

# OR (|)
result = table.slice { |slicer|
  (slicer['age'] < 20) | (slicer['age'] > 22)
}

# XOR (^)
result = table.slice { |slicer|
  (slicer['age'] < 21) ^ (slicer['name'] == 'Tom')
}

Hash-based Filtering

# Filter by exact match
result = table.slice('name' => 'Tom')

# Filter by range
result = table.slice('age' => 20..25)

Array-based Filtering

# Boolean array
filter = [true, false, true]
result = table.slice(filter)

# Arrow BooleanArray
filter_array = Arrow::BooleanArray.new([true, false, true])
result = table.slice(filter_array)

Grouping and Aggregation

Perform group-by operations:

table = Arrow::Table.new(
  'name' => ['Tom', 'Max', 'Kate', 'Tom'],
  'amount' => [10, 2, 3, 5]
)

# Group and sum
result = table.group('name').sum('amount')
# Output:
#   name	amount
# 0 Kate	     3
# 1  Max	     2
# 2  Tom	    15

# Other aggregation functions
table.group('name').count('amount')
table.group('name').mean('amount')
table.group('name').min('amount')
table.group('name').max('amount')

Joining Tables

Join tables using common keys:

amounts = Arrow::Table.new(
  'name' => ['Tom', 'Max', 'Kate'],
  'amount' => [10, 2, 3]
)

levels = Arrow::Table.new(
  'name' => ['Max', 'Kate', 'Tom'],
  'level' => [1, 9, 5]
)

# Natural join on common column
result = amounts.join(levels, [:name])
# Output:
#   name	amount	name	level
# 0  Tom	    10	 Tom	    5
# 1  Max	     2	 Max	    1
# 2 Kate	     3	Kate	    9

Join Types

# Inner join (default)
table1.join(table2, [:key], type: :inner)

# Left outer join
table1.join(table2, [:key], type: :left_outer)

# Right outer join
table1.join(table2, [:key], type: :right_outer)

# Full outer join
table1.join(table2, [:key], type: :full_outer)

Different Key Names

# Join when key columns have different names
table1.join(table2, {left: 'user_id', right: 'id'})

Transforming Data

Adding Columns

# Merge adds or replaces columns
new_column = Arrow::Int64Array.new([100, 200, 300])
result = table.merge('score' => new_column)

Removing Columns

# Remove by name
result = table.remove_column('age')

# Remove by index
result = table.remove_column(1)

Slicing by Range

# Slice rows 2 to 4 (inclusive)
result = table.slice(2..4)

# Slice rows 2 to 4 (exclusive end)
result = table.slice(2...4)

# Slice with offset and length
result = table.slice(2, 3)  # 3 rows starting at index 2

Working with Compute Functions

Access Arrow’s compute functions directly:

# Find a compute function
add = Arrow::Function.find('add')

# Execute function
result = add.execute([table['age'].data, table['age'].data])
ages_doubled = result.value

Common functions:

Arithmetic: add, subtract, multiply, divide
Comparison: equal, greater, less, greater_equal, less_equal
String: string_length, starts_with, ends_with
Statistical: sum, mean, min, max, stddev

Reading and Writing Streams

Writing Streams

# Define schema
fields = [
  Arrow::Field.new('uint8', :uint8),
  Arrow::Field.new('uint16', :uint16),
  Arrow::Field.new('int32', :int32)
]
schema = Arrow::Schema.new(fields)

# Write to stream
Arrow::FileOutputStream.open('/tmp/stream.arrow', false) do |output|
  Arrow::RecordBatchStreamWriter.open(output, schema) do |writer|
    # Create record batches
    columns = [
      Arrow::UInt8Array.new([1, 2, 4, 8]),
      Arrow::UInt16Array.new([1, 2, 4, 8]),
      Arrow::Int32Array.new([1, -2, 4, -8])
    ]
    record_batch = Arrow::RecordBatch.new(schema, 4, columns)
    writer.write_record_batch(record_batch)
  end
end

Reading Streams

Arrow::MemoryMappedInputStream.open('/tmp/stream.arrow') do |input|
  reader = Arrow::RecordBatchStreamReader.new(input)
  fields = reader.schema.fields
  
  reader.each_with_index do |record_batch, i|
    puts "Record batch #{i}:"
    fields.each do |field|
      field_name = field.name
      values = record_batch.collect { |record| record[field_name] }
      puts "  #{field_name}: #{values.inspect}"
    end
  end
end

Memory Management

Packing Tables

Optimize memory layout by packing chunked arrays:

# Pack consolidates chunked arrays into contiguous memory
packed_table = table.pack

Memory-Mapped Files

Use memory mapping for efficient file access:

Arrow::MemoryMappedInputStream.open('/path/to/file.arrow') do |input|
  reader = Arrow::RecordBatchFileReader.new(input)
  table = reader.read_all
end

Type System

Red Arrow supports all Arrow data types:

Numeric Types

:int8, :int16, :int32, :int64
:uint8, :uint16, :uint32, :uint64
:float, :double

String and Binary

:string, :binary
:large_string, :large_binary

Temporal Types

:date32, :date64
:time32, :time64
:timestamp
:duration

Other Types

:boolean
:decimal128, :decimal256
:list, :large_list, :fixed_size_list
:struct, :map
:dictionary

Best Practices

Use explicit schemas for production code to ensure data consistency
Pack tables when memory is constrained or before serialization
Use memory-mapped I/O for large files
Leverage columnar operations instead of row-by-row processing
Batch operations when possible for better performance
Close resources explicitly or use blocks for automatic cleanup

Next Steps

Explore the Ruby source code for advanced usage
Check out the Apache Arrow documentation for deeper understanding
Join the Apache Arrow community for support

C++

Python

R

Ruby

Other Languages

Documentation Index

​Basic Usage

​Getting Started

​Creating Tables

​From Ruby Hash

​From Arrays

​With Explicit Schema

​From Raw Records

​Loading and Saving Data

​Loading from Files

​Loading from S3

​Loading from HTTP

​Saving Tables

​Accessing Data

​Column Access

​Row Access

​Filtering Data

​Using Slicer

​Combining Conditions

​Hash-based Filtering

​Array-based Filtering

​Grouping and Aggregation

​Joining Tables

​Join Types

​Different Key Names

​Transforming Data

​Adding Columns

​Removing Columns

​Slicing by Range

​Working with Compute Functions

​Reading and Writing Streams

​Writing Streams

​Reading Streams

​Memory Management

​Packing Tables

​Memory-Mapped Files

​Type System

​Numeric Types

​String and Binary

​Temporal Types

​Other Types

​Best Practices

​Next Steps