Python Quickstart

This guide will get you up and running with PyArrow quickly. You’ll learn how to create arrays, build tables, perform computations, and work with data files.

Prerequisites

You’ll need:

Python 3.10 or higher
pip or conda package manager
Basic familiarity with Python and pandas (optional)

Install PyArrow

pip
conda
With pandas

pip install pyarrow

conda install -c conda-forge pyarrow

pip install pyarrow pandas

Verify the installation:

import pyarrow as pa
print(pa.__version__)

Create Your First Arrays

Arrays are the fundamental data structure in Arrow - homogeneous, typed collections of data.

import pyarrow as pa

# Create arrays from Python lists
days = pa.array([1, 12, 17, 23, 28], type=pa.int8())
print(days)
# Output: [1, 12, 17, 23, 28]

# Create arrays with different types
names = pa.array(["Alice", "Bob", "Carol"])
scores = pa.array([95.5, 87.3, 92.1], type=pa.float64())

# Arrays with null values
optional_data = pa.array([1, None, 3, None, 5])
print(optional_data)
# Output: [1, null, 3, null, 5]

Key points:

Arrays are immutable after creation
Each array has a single data type
Null values are supported natively
Arrow uses efficient columnar memory layout

Build Tables from Arrays

Tables organize multiple arrays into named columns - similar to pandas DataFrames but more efficient.

import pyarrow as pa

# Create arrays for each column
days = pa.array([1, 12, 17, 23, 28], type=pa.int8())
months = pa.array([1, 3, 5, 7, 1], type=pa.int8())
years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16())

# Build table with named columns
birthdays_table = pa.table(
    [days, months, years],
    names=["days", "months", "years"]
)

print(birthdays_table)

Output:

pyarrow.Table
days: int8
months: int8
years: int16
----
days: [[1,12,17,23,28]]
months: [[1,3,5,7,1]]
years: [[1990,2000,1995,2000,1995]]

Access table data:

# Get column by name
print(birthdays_table["years"])

# Get column by index
print(birthdays_table[0])

# Table metadata
print(f"Rows: {birthdays_table.num_rows}")
print(f"Columns: {birthdays_table.num_columns}")
print(f"Schema: {birthdays_table.schema}")

Write and Read Parquet Files

Parquet is the most common format for Arrow data - it’s columnar, compressed, and very fast.

import pyarrow as pa
import pyarrow.parquet as pq

# Create a table
days = pa.array([1, 12, 17, 23, 28], type=pa.int8())
months = pa.array([1, 3, 5, 7, 1], type=pa.int8())
years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16())

birthdays_table = pa.table(
    [days, months, years],
    names=["days", "months", "years"]
)

# Write to Parquet file
pq.write_table(birthdays_table, 'birthdays.parquet')
print("Wrote birthdays.parquet")

# Read from Parquet file
reloaded_table = pq.read_table('birthdays.parquet')
print("\nRead table:")
print(reloaded_table)

# Read specific columns only
days_only = pq.read_table('birthdays.parquet', columns=['days'])
print("\nRead only 'days' column:")
print(days_only)

Why Parquet?

Columnar format = faster queries
Built-in compression = smaller files
Preserves Arrow types perfectly
Industry standard for analytics

Perform Computations

Arrow provides a rich set of compute functions for data processing.

import pyarrow as pa
import pyarrow.compute as pc

# Create sample data
ages = pa.array([25, 30, 35, 40, 45, 30, 35])
cities = pa.array(["NYC", "SF", "LA", "NYC", "SF", "NYC", "LA"])

# Statistical functions
print(f"Mean age: {pc.mean(ages)}")
print(f"Min age: {pc.min(ages).as_py()}")
print(f"Max age: {pc.max(ages).as_py()}")
print(f"Sum: {pc.sum(ages).as_py()}")

# Value counts
counts = pc.value_counts(cities)
print(f"\nCity counts: {counts}")

# Filtering
young = pc.less(ages, 35)
print(f"\nAges < 35: {pc.filter(ages, young)}")

# Arithmetic
ages_in_months = pc.multiply(ages, 12)
print(f"\nAges in months: {ages_in_months}")

Common compute functions:

Math: add, subtract, multiply, divide
Stats: mean, stddev, variance, min, max
Strings: utf8_upper, utf8_lower, split_pattern
Comparisons: equal, less, greater
Aggregations: sum, count, value_counts

Work with Large Datasets

For data that doesn’t fit in memory, use the Dataset API with partitioning.

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc

# Create sample data
data = pa.table({
    "year": [2020, 2020, 2021, 2021, 2022, 2022],
    "month": [1, 2, 1, 2, 1, 2],
    "revenue": [100, 150, 120, 180, 140, 200]
})

# Write partitioned dataset
ds.write_dataset(
    data,
    "revenue_data",
    format="parquet",
    partitioning=ds.partitioning(
        pa.schema([("year", pa.int64())])
    )
)

print("Wrote partitioned dataset to revenue_data/")

# Open dataset (lazy - doesn't load all data)
dataset = ds.dataset("revenue_data", format="parquet")

print(f"\nDataset files: {dataset.files}")

# Query with filtering (only reads relevant partitions)
result = dataset.to_table(
    filter=pc.field("year") == 2021
)

print("\nFiltered data (year=2021):")
print(result)

# Scan and aggregate
for batch in dataset.to_batches():
    print(f"Batch: {batch.num_rows} rows")

Dataset benefits:

Works with data larger than memory
Partition pruning for fast queries
Reads only needed columns and partitions
Supports multiple file formats

Convert to/from Pandas

PyArrow integrates seamlessly with pandas for easy data exchange.

import pyarrow as pa
import pandas as pd

# Create pandas DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [30, 25, 35],
    'city': ['NYC', 'SF', 'LA']
})

print("Pandas DataFrame:")
print(df)

# Convert to Arrow Table (zero-copy when possible)
table = pa.Table.from_pandas(df)
print("\nArrow Table:")
print(table)

# Convert back to pandas
df_back = table.to_pandas()
print("\nBack to pandas:")
print(df_back)

# Use pandas string dtype for efficiency
df_strings = table.to_pandas(strings_to_categorical=True)
print("\nWith categorical strings:")
print(df_strings.dtypes)

Why use Arrow with pandas?

Faster I/O (especially Parquet)
Better memory efficiency
Preserve all data types correctly
Enable zero-copy operations

Complete Example

Here’s a complete workflow that demonstrates common Arrow operations:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc

# 1. Create data
data = pa.table({
    'product': ['A', 'B', 'A', 'C', 'B', 'A'],
    'quantity': [10, 20, 15, 5, 25, 12],
    'price': [100.0, 200.0, 100.0, 150.0, 200.0, 100.0]
})

print("Original data:")
print(data)

# 2. Compute total value
revenue = pc.multiply(data['quantity'], data['price'])
data = data.append_column('revenue', revenue)

print("\nWith revenue column:")
print(data)

# 3. Filter for high-value items
high_value = pc.greater(data['revenue'], 1500)
filtered = pc.filter(data, high_value)

print("\nHigh-value items (revenue > 1500):")
print(filtered)

# 4. Save to Parquet
pq.write_table(data, 'sales.parquet')
print("\nSaved to sales.parquet")

# 5. Read and analyze
loaded = pq.read_table('sales.parquet')
total_revenue = pc.sum(loaded['revenue']).as_py()
print(f"\nTotal revenue: ${total_revenue:,.2f}")

# 6. Group by product (using unique + filter)
unique_products = pc.unique(loaded['product']).to_pylist()
for product in unique_products:
    mask = pc.equal(loaded['product'], product)
    product_data = pc.filter(loaded, mask)
    product_revenue = pc.sum(product_data['revenue']).as_py()
    print(f"{product}: ${product_revenue:,.2f}")

Next Steps

Compute Functions

Explore all available compute functions

CSV Files

Fast CSV reading and writing

Working with Pandas

Deep integration with pandas

API Reference

Complete PyArrow API documentation

Common Patterns

Reading Large CSV Files

import pyarrow.csv as csv

# Read with streaming for large files
table = csv.read_csv('large_file.csv')

# Or use dataset API for multiple files
import pyarrow.dataset as ds
dataset = ds.dataset('data/', format='csv')

Working with Schemas

# Define explicit schema
schema = pa.schema([
    ('name', pa.string()),
    ('age', pa.int64()),
    ('balance', pa.float64())
])

# Create table with schema
table = pa.table(data, schema=schema)

Handling Nested Data

# Create nested array
nested = pa.array([
    [1, 2, 3],
    [4, 5],
    [6, 7, 8, 9]
])

# Create struct array
structs = pa.array([
    {'name': 'Alice', 'age': 30},
    {'name': 'Bob', 'age': 25}
])

Performance Tips

Use columnar operations

Arrow is optimized for columnar operations. Process entire columns at once instead of row-by-row:

# Good: columnar
result = pc.multiply(table['quantity'], table['price'])

# Avoid: row-by-row
result = [row['quantity'] * row['price'] for row in table.to_pylist()]

Read only needed columns

When reading Parquet, specify only the columns you need:

table = pq.read_table('data.parquet', columns=['name', 'age'])

Use dataset API for large data

For data larger than memory, use datasets with filtering:

dataset = ds.dataset('large_data/', format='parquet')
result = dataset.to_table(filter=pc.field('year') == 2023)

Batch processing

Process data in batches to control memory usage:

for batch in dataset.to_batches(batch_size=10000):
    # Process batch
    pass

Troubleshooting

ImportError: No module named 'pyarrow'

Make sure PyArrow is installed in your current Python environment:

pip install pyarrow
# or
conda install -c conda-forge pyarrow

Schema mismatch errors

When appending or combining tables, ensure schemas match:

print(table1.schema)
print(table2.schema)
# Cast if needed
table2 = table2.cast(table1.schema)

Memory issues with large files

Use the dataset API instead of loading entire files:

# Instead of pq.read_table()
dataset = ds.dataset('file.parquet', format='parquet')
for batch in dataset.to_batches():
    process(batch)

Installation

Quickstart Guides

Prerequisites

Complete Example

Next Steps

Compute Functions

CSV Files

Working with Pandas

API Reference

Common Patterns

Reading Large CSV Files

Working with Schemas

Handling Nested Data

Performance Tips

Troubleshooting

Installation

Quickstart Guides

Documentation Index

​Prerequisites

​Complete Example

​Next Steps

Compute Functions

CSV Files

Working with Pandas

API Reference

​Common Patterns

​Reading Large CSV Files

​Working with Schemas

​Handling Nested Data

​Performance Tips

​Troubleshooting

Prerequisites

Complete Example

Next Steps

Common Patterns

Reading Large CSV Files

Working with Schemas

Handling Nested Data

Performance Tips

Troubleshooting