Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/apache/arrow/llms.txt

Use this file to discover all available pages before exploring further.

Basic Usage

This guide covers fundamental operations with Red Arrow, from creating tables to performing data transformations.

Getting Started

First, require the Arrow library:
require 'arrow'

Creating Tables

From Ruby Hash

The simplest way to create a table is from a Ruby hash. Data types are automatically detected:
table = Arrow::Table.new(
  'name' => ['Alice', 'Bob', 'Charlie'],
  'age' => [25, 30, 35],
  'salary' => [50000.0, 60000.0, 75000.0]
)

puts table.to_s
# Output:
#         name	  age	    salary
#      (string)	(int64)	  (double)
# 0       Alice	     25	  50000.0
# 1         Bob	     30	  60000.0
# 2     Charlie	     35	  75000.0

From Arrays

Create tables using Arrow array types:
count_array = Arrow::UInt32Array.new([0, 2, nil, 4])
visible_array = Arrow::BooleanArray.new([true, nil, nil, false])

table = Arrow::Table.new(
  'count' => count_array,
  'visible' => visible_array
)

With Explicit Schema

Define schema explicitly for precise control:
# Define fields
count_field = Arrow::Field.new('count', :uint32)
visible_field = Arrow::Field.new('visible', :boolean)
schema = Arrow::Schema.new([count_field, visible_field])

# Create arrays
count_array = Arrow::UInt32Array.new([0, 2, nil, 4])
visible_array = Arrow::BooleanArray.new([true, nil, nil, false])

# Create table with schema
table = Arrow::Table.new(schema, [count_array, visible_array])

From Raw Records

Create tables from arrays of records:
schema = {
  count: :uint32,
  visible: :boolean
}

raw_records = [
  [0, true],
  [2, nil],
  [nil, nil],
  [4, false]
]

table = Arrow::Table.new(schema, raw_records)

Loading and Saving Data

Loading from Files

# Load Arrow IPC file
table = Arrow::Table.load('data.arrow')

# Load CSV
table = Arrow::Table.load('data.csv', format: :csv)

# Load Parquet (requires red-parquet)
require 'parquet'
table = Arrow::Table.load('data.parquet', format: :parquet)

Loading from S3

With red-arrow-dataset, load directly from S3:
require 'arrow-dataset'

# Public bucket
table = Arrow::Table.load(URI('s3://bucket/data.csv'))

# Private bucket with credentials
require 'cgi/util'
access_key = 'YOUR_ACCESS_KEY'
secret_key = 'YOUR_SECRET_KEY'
uri = URI("s3://#{CGI.escape(access_key)}:#{CGI.escape(secret_key)}@bucket/data.parquet")
table = Arrow::Table.load(uri)

Loading from HTTP

require 'net/http'

params = {
  query: "SELECT * FROM table LIMIT 10 FORMAT Arrow",
  user: "username",
  password: "password"
}
uri = URI('https://example.com/query')
uri.query = URI.encode_www_form(params)
resp = Net::HTTP.get(uri)

table = Arrow::Table.load(Arrow::Buffer.new(resp))

Saving Tables

# Save as Arrow IPC file
table.save('output.arrow')

# Save as CSV
table.save('output.csv', format: :csv)

# Save as Parquet
require 'parquet'
table.save('output.parquet', format: :parquet)

Accessing Data

Column Access

# Access column by name
age_column = table['age']
age_column = table[:age]

# Access column by index
first_column = table.columns[0]

# Get column names
table.column_names  # => ['name', 'age', 'salary']

# Get number of columns
table.n_columns  # => 3

Row Access

# Get number of rows
table.n_rows
table.size
table.length

# Access single row (returns Arrow::Record)
row = table.slice(0)
row['name']   # => 'Alice'
row['age']    # => 25

# Iterate over rows
table.each_record_batch do |record_batch|
  record_batch.each do |record|
    puts "#{record['name']}: #{record['age']}"
  end
end

Filtering Data

Using Slicer

Red Arrow provides a powerful slicer syntax for filtering:
table = Arrow::Table.new(
  'name' => ['Tom', 'Max', 'Kate'],
  'age' => [22, 23, 19]
)

# Simple condition
result = table.slice { |slicer| slicer['age'] > 19 }
# Returns rows where age > 19

# Range condition
result = table.slice { |slicer| slicer['age'].in?(19..22) }
# Returns rows where age is between 19 and 22

Combining Conditions

Use logical operators to combine filters:
# AND (&)
result = table.slice { |slicer|
  (slicer['age'] > 19) & (slicer['age'] < 23)
}

# OR (|)
result = table.slice { |slicer|
  (slicer['age'] < 20) | (slicer['age'] > 22)
}

# XOR (^)
result = table.slice { |slicer|
  (slicer['age'] < 21) ^ (slicer['name'] == 'Tom')
}

Hash-based Filtering

# Filter by exact match
result = table.slice('name' => 'Tom')

# Filter by range
result = table.slice('age' => 20..25)

Array-based Filtering

# Boolean array
filter = [true, false, true]
result = table.slice(filter)

# Arrow BooleanArray
filter_array = Arrow::BooleanArray.new([true, false, true])
result = table.slice(filter_array)

Grouping and Aggregation

Perform group-by operations:
table = Arrow::Table.new(
  'name' => ['Tom', 'Max', 'Kate', 'Tom'],
  'amount' => [10, 2, 3, 5]
)

# Group and sum
result = table.group('name').sum('amount')
# Output:
#   name	amount
# 0 Kate	     3
# 1  Max	     2
# 2  Tom	    15

# Other aggregation functions
table.group('name').count('amount')
table.group('name').mean('amount')
table.group('name').min('amount')
table.group('name').max('amount')

Joining Tables

Join tables using common keys:
amounts = Arrow::Table.new(
  'name' => ['Tom', 'Max', 'Kate'],
  'amount' => [10, 2, 3]
)

levels = Arrow::Table.new(
  'name' => ['Max', 'Kate', 'Tom'],
  'level' => [1, 9, 5]
)

# Natural join on common column
result = amounts.join(levels, [:name])
# Output:
#   name	amount	name	level
# 0  Tom	    10	 Tom	    5
# 1  Max	     2	 Max	    1
# 2 Kate	     3	Kate	    9

Join Types

# Inner join (default)
table1.join(table2, [:key], type: :inner)

# Left outer join
table1.join(table2, [:key], type: :left_outer)

# Right outer join
table1.join(table2, [:key], type: :right_outer)

# Full outer join
table1.join(table2, [:key], type: :full_outer)

Different Key Names

# Join when key columns have different names
table1.join(table2, {left: 'user_id', right: 'id'})

Transforming Data

Adding Columns

# Merge adds or replaces columns
new_column = Arrow::Int64Array.new([100, 200, 300])
result = table.merge('score' => new_column)

Removing Columns

# Remove by name
result = table.remove_column('age')

# Remove by index
result = table.remove_column(1)

Slicing by Range

# Slice rows 2 to 4 (inclusive)
result = table.slice(2..4)

# Slice rows 2 to 4 (exclusive end)
result = table.slice(2...4)

# Slice with offset and length
result = table.slice(2, 3)  # 3 rows starting at index 2

Working with Compute Functions

Access Arrow’s compute functions directly:
# Find a compute function
add = Arrow::Function.find('add')

# Execute function
result = add.execute([table['age'].data, table['age'].data])
ages_doubled = result.value
Common functions:
  • Arithmetic: add, subtract, multiply, divide
  • Comparison: equal, greater, less, greater_equal, less_equal
  • String: string_length, starts_with, ends_with
  • Statistical: sum, mean, min, max, stddev

Reading and Writing Streams

Writing Streams

# Define schema
fields = [
  Arrow::Field.new('uint8', :uint8),
  Arrow::Field.new('uint16', :uint16),
  Arrow::Field.new('int32', :int32)
]
schema = Arrow::Schema.new(fields)

# Write to stream
Arrow::FileOutputStream.open('/tmp/stream.arrow', false) do |output|
  Arrow::RecordBatchStreamWriter.open(output, schema) do |writer|
    # Create record batches
    columns = [
      Arrow::UInt8Array.new([1, 2, 4, 8]),
      Arrow::UInt16Array.new([1, 2, 4, 8]),
      Arrow::Int32Array.new([1, -2, 4, -8])
    ]
    record_batch = Arrow::RecordBatch.new(schema, 4, columns)
    writer.write_record_batch(record_batch)
  end
end

Reading Streams

Arrow::MemoryMappedInputStream.open('/tmp/stream.arrow') do |input|
  reader = Arrow::RecordBatchStreamReader.new(input)
  fields = reader.schema.fields
  
  reader.each_with_index do |record_batch, i|
    puts "Record batch #{i}:"
    fields.each do |field|
      field_name = field.name
      values = record_batch.collect { |record| record[field_name] }
      puts "  #{field_name}: #{values.inspect}"
    end
  end
end

Memory Management

Packing Tables

Optimize memory layout by packing chunked arrays:
# Pack consolidates chunked arrays into contiguous memory
packed_table = table.pack

Memory-Mapped Files

Use memory mapping for efficient file access:
Arrow::MemoryMappedInputStream.open('/path/to/file.arrow') do |input|
  reader = Arrow::RecordBatchFileReader.new(input)
  table = reader.read_all
end

Type System

Red Arrow supports all Arrow data types:

Numeric Types

:int8, :int16, :int32, :int64
:uint8, :uint16, :uint32, :uint64
:float, :double

String and Binary

:string, :binary
:large_string, :large_binary

Temporal Types

:date32, :date64
:time32, :time64
:timestamp
:duration

Other Types

:boolean
:decimal128, :decimal256
:list, :large_list, :fixed_size_list
:struct, :map
:dictionary

Best Practices

  1. Use explicit schemas for production code to ensure data consistency
  2. Pack tables when memory is constrained or before serialization
  3. Use memory-mapped I/O for large files
  4. Leverage columnar operations instead of row-by-row processing
  5. Batch operations when possible for better performance
  6. Close resources explicitly or use blocks for automatic cleanup

Next Steps