Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/apache/arrow/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers debugging techniques, tools, and best practices for Apache Arrow development across C++, Python, and R implementations.

Build Configuration for Debugging

Proper build configuration is essential for effective debugging.

Debug Build

Build Arrow C++ in debug mode for better debugging:
cd arrow/cpp
mkdir build-debug
cd build-debug
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_BUILD_TESTS=ON \
  -DARROW_EXTRA_ERROR_CONTEXT=ON
make -j8
Key options:
  • CMAKE_BUILD_TYPE=Debug: Disables optimizations, enables debug symbols
  • ARROW_EXTRA_ERROR_CONTEXT=ON: Provides additional error context information
  • Debug builds are slower but essential for debugging

RelWithDebInfo Build

For debugging with some optimizations:
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
This provides a middle ground with debug symbols but some optimizations enabled.

Compiler Warning Level

Set BUILD_WARNING_LEVEL=CHECKIN for stricter warnings:
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DBUILD_WARNING_LEVEL=CHECKIN
With gcc/clang, this adds -Werror (treat warnings as errors). With MSVC, adds /WX.

Debugging Tools

GDB (GNU Debugger)

GDB is the primary debugger for C++ on Linux.
1

Launch program in GDB

gdb --args ./build-debug/arrow-array-test
2

Set breakpoints

# Break at function
(gdb) break arrow::Array::Validate

# Break at file:line
(gdb) break array.cc:123

# Conditional breakpoint
(gdb) break array.cc:123 if length > 1000
3

Run the program

(gdb) run
4

Navigate execution

(gdb) next      # Step over
(gdb) step      # Step into
(gdb) continue  # Continue to next breakpoint
(gdb) finish    # Run until function returns
5

Inspect variables

(gdb) print variable_name
(gdb) print *pointer
(gdb) print array->length()
6

View backtrace

(gdb) backtrace
(gdb) bt full  # With local variables

LLDB (LLVM Debugger)

LLDB is the primary debugger on macOS and an alternative on Linux.
# Launch in LLDB
lldb -- ./build-debug/arrow-array-test

# Set breakpoint
(lldb) breakpoint set --name arrow::Array::Validate
(lldb) b array.cc:123

# Run
(lldb) run

# Navigate
(lldb) next
(lldb) step
(lldb) continue

# Inspect
(lldb) print variable_name
(lldb) frame variable

# Backtrace
(lldb) bt

Visual Studio Code

VSCode provides excellent debugging support for Arrow:
1

Install C++ extension

Install the “C/C++” extension by Microsoft.
2

Create launch configuration

Create .vscode/launch.json:
{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Debug Arrow Test",
      "type": "cppdbg",
      "request": "launch",
      "program": "${workspaceFolder}/cpp/build-debug/arrow-array-test",
      "args": [],
      "stopAtEntry": false,
      "cwd": "${workspaceFolder}",
      "environment": [],
      "MIMode": "gdb"
    }
  ]
}
3

Set breakpoints and debug

Click in the gutter next to line numbers to set breakpoints, then press F5 to start debugging.

Python Debugging

Built-in Python debugger:
import pyarrow as pa
import pdb

# Set breakpoint
pdb.set_trace()

# Or use breakpoint() in Python 3.7+
breakpoint()

array = pa.array([1, 2, 3])
Common pdb commands:
n       # Next line
s       # Step into
c       # Continue
l       # List code
p var   # Print variable
w       # Where (show stack trace)

Sanitizers

Sanitizers detect various types of bugs at runtime.

Address Sanitizer (ASan)

Detects memory errors like use-after-free, buffer overflows, and memory leaks.
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_USE_ASAN=ON
make -j8

# Run tests
export ASAN_OPTIONS=detect_leaks=1
./build-debug/arrow-array-test

Undefined Behavior Sanitizer (UBSan)

Detects undefined behavior like integer overflow, null pointer dereference.
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_USE_UBSAN=ON
make -j8

Thread Sanitizer (TSan)

Detects data races and thread-related issues.
cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DARROW_USE_TSAN=ON
make -j8
Don’t combine multiple sanitizers in the same build. They have conflicts and overhead.

Common Debugging Scenarios

Segmentation Faults

1

Get a backtrace

gdb --args ./program
(gdb) run
# When it crashes:
(gdb) backtrace
2

Enable core dumps

ulimit -c unlimited
./program
# After crash:
gdb ./program core
(gdb) backtrace
3

Use AddressSanitizer

Rebuild with ASan and re-run. It often pinpoints the exact error.

Memory Leaks

1

Use Valgrind

valgrind --leak-check=full ./build-debug/arrow-array-test
2

Use AddressSanitizer

export ASAN_OPTIONS=detect_leaks=1
./build-debug/arrow-array-test

Build Failures

# C++
rm -rf build-debug
mkdir build-debug && cd build-debug
cmake ..
make -j8

# Python
cd arrow/python
rm -rf build/
python setup.py clean --all
python setup.py build_ext --inplace
# CMake
make VERBOSE=1

# Python
export PYARROW_BUILD_VERBOSE=1
python setup.py build_ext --inplace
# Verify CMake can find dependencies
cmake .. -DCMAKE_FIND_DEBUG_MODE=ON

Test Failures

# C++
./build-debug/arrow-array-test --gtest_filter=TestArray.TestBasics

# Python
pytest pyarrow/tests/test_array.py::test_basics -v

# R
devtools::test_active_file()
gdb --args ./build-debug/arrow-array-test --gtest_filter=TestArray.TestBasics
(gdb) break arrow_array.cc:123
(gdb) run
# C++
./arrow-array-test --gtest_verbose

# Python
pytest pyarrow/tests/test_array.py -vv -s

Import Errors (Python)

import sys
print(sys.path)

import pyarrow
print(pyarrow.__file__)
print(pyarrow.__version__)

# Check if C++ library loads
import pyarrow._lib
If imports fail:
# Ensure you're in the right directory
cd arrow/python

# Check if built in-place
ls -la pyarrow/*.so

# Rebuild if needed
python setup.py build_ext --inplace

Debugging CI Failures

When tests pass locally but fail in CI:
1

Check the CI logs

Look for the specific error message and stack trace in the CI output.
2

Reproduce the CI environment

Use Docker to reproduce the exact CI environment:
# See dev/docker-compose.yml for available images
docker-compose run ubuntu-cpp
3

Check for platform-specific issues

Test on the same platform where CI failed (Linux, macOS, Windows).
4

Review sanitizer reports

CI runs with AddressSanitizer and UndefinedBehaviorSanitizer. Check for sanitizer warnings in logs.

Logging and Error Messages

C++ Logging

Arrow uses a custom logging system:
#include <arrow/util/logging.h>

ARROW_LOG(INFO) << "Processing array with length: " << array->length();
ARROW_LOG(WARNING) << "Unexpected null values";
ARROW_LOG(ERROR) << "Failed to allocate memory";
Control log level:
export ARROW_LOG_LEVEL=DEBUG
./program

Python Logging

import logging
import pyarrow as pa

logging.basicConfig(level=logging.DEBUG)
pa.set_cpu_count(4)  # Will log CPU count changes

Performance Debugging

See the Benchmarking Guide for:
  • Running performance benchmarks
  • Comparing performance across versions
  • Identifying performance regressions
Additional profiling tools:
  • perf (Linux): CPU profiling
  • Instruments (macOS): System-wide profiling
  • py-spy (Python): Python profiling without code changes
  • valgrind —tool=callgrind: Call graph profiling

Resources

GDB Documentation

Official GDB documentation

LLDB Tutorial

Getting started with LLDB

AddressSanitizer

Address Sanitizer documentation

Valgrind Manual

Valgrind quick start guide