Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/arrow/llms.txt
Use this file to discover all available pages before exploring further.
Apache Arrow provides CUDA integration for GPU-accelerated data processing. The GPU support enables zero-copy data sharing between CPU and GPU, efficient memory management on CUDA devices, and integration with GPU-based compute libraries.
Overview
Arrow’s CUDA support provides:
- GPU memory management: Allocate and manage buffers on CUDA devices
- Zero-copy transfers: Share data between CPU and GPU without copying
- IPC support: Share GPU buffers between processes using CUDA IPC
- Device abstraction: Unified API for CPU and GPU memory
- Multi-GPU support: Work with multiple CUDA devices
Getting Started
Device Management
Access CUDA devices through the CudaDeviceManager:
#include "arrow/gpu/cuda_api.h"
// Get the device manager singleton
arrow::Result<arrow::cuda::CudaDeviceManager*> result =
arrow::cuda::CudaDeviceManager::Instance();
if (!result.ok()) {
std::cerr << "CUDA not available: " << result.status() << std::endl;
return;
}
auto manager = result.ValueOrDie();
// Get number of available GPUs
int num_devices = manager->num_devices();
std::cout << "Found " << num_devices << " CUDA device(s)" << std::endl;
// Get a specific device (device 0)
auto device_result = manager->GetDevice(0);
if (!device_result.ok()) {
std::cerr << "Failed to get device: " << device_result.status();
return;
}
std::shared_ptr<arrow::cuda::CudaDevice> device =
device_result.ValueOrDie();
std::cout << "Device: " << device->device_name() << std::endl;
std::cout << "Total memory: " << device->total_memory() << " bytes" << std::endl;
CUDA Context
A CudaContext manages the CUDA driver context for a device:
// Get context for device
auto context_result = device->GetContext();
std::shared_ptr<arrow::cuda::CudaContext> context =
context_result.ValueOrDie();
// Get device number
int device_num = context->device_number();
// Synchronize all operations on the device
context->Synchronize();
// Get memory usage
int64_t bytes_allocated = context->bytes_allocated();
GPU Memory Management
Allocating GPU Memory
Allocate memory on a CUDA device:
// Allocate 1 MB on GPU
int64_t size = 1024 * 1024;
auto buffer_result = context->Allocate(size);
if (!buffer_result.ok()) {
std::cerr << "Allocation failed: " << buffer_result.status();
return;
}
std::unique_ptr<arrow::cuda::CudaBuffer> gpu_buffer =
buffer_result.ValueOrDie();
std::cout << "Allocated " << gpu_buffer->size()
<< " bytes on GPU" << std::endl;
Copying Data Between CPU and GPU
// Create CPU buffer with data
std::vector<int32_t> cpu_data(1000, 42);
auto cpu_buffer = arrow::Buffer::Wrap(cpu_data);
// Allocate GPU buffer
auto gpu_buffer = context->Allocate(cpu_buffer->size()).ValueOrDie();
// Copy from CPU to GPU
arrow::Status status = gpu_buffer->CopyFromHost(
0, cpu_buffer->data(), cpu_buffer->size());
if (!status.ok()) {
std::cerr << "Copy to GPU failed: " << status;
}
// Allocate CPU buffer for results
std::vector<int32_t> result_data(1000);
// Copy from GPU back to CPU
status = gpu_buffer->CopyToHost(
0, gpu_buffer->size(), result_data.data());
if (!status.ok()) {
std::cerr << "Copy from GPU failed: " << status;
}
Viewing GPU Memory
Create non-owning views of existing GPU memory:
// Existing GPU allocation (e.g., from another library)
uint8_t* device_ptr = /* pointer to GPU memory */;
int64_t size = /* size of allocation */;
// Create Arrow buffer view
auto view_result = context->View(device_ptr, size);
std::shared_ptr<arrow::cuda::CudaBuffer> buffer_view =
view_result.ValueOrDie();
// Use view without taking ownership
// Original owner is responsible for freeing memory
Host Memory with GPU Access
Allocate pinned CPU memory accessible by GPU:
// Allocate pinned host memory
int64_t size = 1024 * 1024;
auto host_buffer_result = device->AllocateHostBuffer(size);
std::shared_ptr<arrow::cuda::CudaHostBuffer> host_buffer =
host_buffer_result.ValueOrDie();
// Get device address for GPU access
auto device_addr_result = host_buffer->GetDeviceAddress(context);
uintptr_t device_addr = device_addr_result.ValueOrDie();
// GPU can access this address directly
// Enables zero-copy transfers in some cases
Memory Manager Integration
Use Arrow’s unified memory manager API:
// Get memory manager for device
std::shared_ptr<arrow::MemoryManager> mm =
device->default_memory_manager();
// Check if it's a CUDA memory manager
if (arrow::cuda::IsCudaMemoryManager(*mm)) {
auto cuda_mm = arrow::cuda::AsCudaMemoryManager(mm).ValueOrDie();
auto cuda_device = cuda_mm->cuda_device();
std::cout << "Using CUDA device: "
<< cuda_device->device_number() << std::endl;
}
// Allocate through memory manager
auto buffer_result = mm->AllocateBuffer(1024 * 1024);
std::unique_ptr<arrow::Buffer> buffer = buffer_result.ValueOrDie();
Multi-GPU Operations
Copying Between GPUs
// Get two different devices
auto device0 = manager->GetDevice(0).ValueOrDie();
auto device1 = manager->GetDevice(1).ValueOrDie();
auto context0 = device0->GetContext().ValueOrDie();
auto context1 = device1->GetContext().ValueOrDie();
// Allocate on first GPU
auto buffer0 = context0->Allocate(1024).ValueOrDie();
// Allocate on second GPU
auto buffer1 = context1->Allocate(1024).ValueOrDie();
// Copy from GPU 0 to GPU 1
arrow::Status status = buffer1->CopyFromAnotherDevice(
context0, 0, buffer0->address(), buffer0->size());
CUDA IPC (Inter-Process Communication)
Share GPU buffers between processes:
// Process 1: Export buffer for sharing
auto gpu_buffer = context->Allocate(1024 * 1024).ValueOrDie();
// Get IPC handle
auto handle_result = gpu_buffer->ExportForIpc();
std::shared_ptr<arrow::cuda::CudaIpcMemHandle> ipc_handle =
handle_result.ValueOrDie();
// Serialize handle to send to other process
auto serialized = ipc_handle->Serialize().ValueOrDie();
// Send serialized buffer to other process...
// (e.g., via sockets, shared memory, etc.)
// Process 2: Open shared buffer
// Receive serialized handle from other process...
const void* handle_data = /* received data */;
auto handle_result =
arrow::cuda::CudaIpcMemHandle::FromBuffer(handle_data);
auto ipc_handle = handle_result.ValueOrDie();
// Open the shared buffer
auto buffer_result = context->OpenIpcBuffer(*ipc_handle);
std::shared_ptr<arrow::cuda::CudaBuffer> shared_buffer =
buffer_result.ValueOrDie();
// Access shared data
// ...
// Close when done
context->CloseIpcBuffer(shared_buffer.get());
Streams and Events
CUDA Streams
// Create a CUDA stream
auto stream_result = device->MakeStream();
std::shared_ptr<arrow::Device::Stream> stream =
stream_result.ValueOrDie();
// Synchronize stream
stream->Synchronize();
// Wrap existing stream
CUstream cu_stream = /* existing CUDA stream */;
auto wrapped_stream = device->WrapStream(
&cu_stream,
/*release_fn=*/nullptr // Don't free on destroy
).ValueOrDie();
CUDA Events
// Get CUDA memory manager
auto cuda_mm = arrow::cuda::AsCudaMemoryManager(
device->default_memory_manager()).ValueOrDie();
// Create synchronization event
auto event_result = cuda_mm->MakeDeviceSyncEvent();
std::shared_ptr<arrow::Device::SyncEvent> event =
event_result.ValueOrDie();
// Record event on stream
event->Record(*stream);
// Wait for event
event->Wait();
// Wait on different stream
stream2->WaitEvent(*event);
Buffer I/O
Read and write GPU buffers using file-like interfaces:
// Create reader for GPU buffer
auto reader_result = mm->GetBufferReader(gpu_buffer);
std::shared_ptr<arrow::io::RandomAccessFile> reader =
reader_result.ValueOrDie();
// Read data (copies to host)
std::vector<uint8_t> host_data(100);
auto read_result = reader->Read(100, host_data.data());
// Create writer for GPU buffer
auto writer_result = mm->GetBufferWriter(gpu_buffer);
std::shared_ptr<arrow::io::OutputStream> writer =
writer_result.ValueOrDie();
// Write data (copies from host)
std::vector<uint8_t> data_to_write = {1, 2, 3, 4, 5};
writer->Write(data_to_write.data(), data_to_write.size());
Minimize CPU-GPU Transfers
// BAD: Multiple small transfers
for (int i = 0; i < 1000; ++i) {
gpu_buffer->CopyFromHost(i * sizeof(int), &data[i], sizeof(int));
}
// GOOD: Single large transfer
gpu_buffer->CopyFromHost(0, data.data(), data.size() * sizeof(int));
Use Pinned Memory for Frequent Transfers
// Allocate pinned memory once
auto host_buffer = device->AllocateHostBuffer(size).ValueOrDie();
// Reuse for multiple transfers (faster than pageable memory)
for (const auto& batch : batches) {
// Copy to pinned memory
std::memcpy(host_buffer->mutable_data(),
batch.data(), batch.size());
// Transfer to GPU (faster with pinned memory)
gpu_buffer->CopyFromHost(0, host_buffer->data(), size);
}
Asynchronous Operations
// Use streams for overlapping operations
auto stream1 = device->MakeStream().ValueOrDie();
auto stream2 = device->MakeStream().ValueOrDie();
// Launch operations on different streams
// (can execute concurrently)
LaunchKernel1(stream1);
LaunchKernel2(stream2);
stream1->Synchronize();
stream2->Synchronize();
When to Use GPU Support
GPU support is beneficial for:
- Large-scale compute: Operations on hundreds of MBs to GBs of data
- Parallel algorithms: Data-parallel operations that map well to GPU architecture
- Integration with GPU libraries: Using CUDA-based ML/DL frameworks
- Minimizing copies: Zero-copy sharing with GPU compute engines
CPU may be better for:
- Small datasets: GPU transfer overhead dominates
- Sequential operations: Limited parallelism opportunity
- Complex control flow: GPUs excel at data-parallel, not task-parallel workloads
Prerequisites
- NVIDIA GPU with CUDA support (compute capability 3.0+)
- CUDA toolkit installed (version 9.0 or later)
- Arrow built with
ARROW_CUDA=ON CMake option
- Appropriate GPU drivers installed
Limitations
- CUDA IPC only works between processes on the same physical machine
- IPC handles cannot be serialized across network
- GPU memory is limited; monitor allocation carefully
- Not all Arrow operations have GPU implementations
Error Handling
All CUDA operations return arrow::Result or arrow::Status:
auto result = context->Allocate(size);
if (!result.ok()) {
std::cerr << "Allocation failed: " << result.status().ToString();
// Handle error...
return;
}
auto buffer = result.ValueOrDie();
// Or use ARROW_ASSIGN_OR_RAISE macro
ARROW_ASSIGN_OR_RAISE(auto buffer, context->Allocate(size));