ONNX Runtime GPU: Why Build It Yourself?

Before diving into the actual build, let’s clarify a few concepts.

What is onnxruntime_gpu?



ONNX Runtime (ORT) is an optimized AI inference engine. After converting a model built with PyTorch or TensorFlow to the .onnx format, ORT helps you run it as fast as possible on a variety of hardware. The onnxruntime_gpu variant leverages NVIDIA GPU CUDA and cuDNN to squeeze out maximum performance.

image of onnxruntime

Why Is It Commonly Used for Generative AI Models (Images/Video)?

Models like Stable Diffusion or Whisper are massive and computationally intensive. ORT offers several optimizations:

  • Graph Optimization: Removes redundant operations and fuses multiple operators into a single one.
  • Memory Management: Handles GPU memory allocation efficiently so even large models run smoothly.
  • Hardware Acceleration: When paired with accelerators such as TensorRT, it can deliver several‑fold speedups over pure PyTorch.

Why Do ARM/aarch64 Users Need to Build from Source?



Running pip install onnxruntime-gpu works fine on most platforms, but aarch64 is a different story.

  1. No Pre‑built Binaries: The wheels on PyPI are primarily built for x86_64.
  2. Cutting‑Edge Architecture Support: Very new stacks like CUDA 13.0 or Compute Capability 12.1 (Blackwell) often aren’t covered by official releases yet.
  3. Tailored Optimizations: If you want to fine‑tune the build for a specific machine (e.g., DGX Spark), compiling yourself is the only way.

Building ONNX Runtime GPU on DGX‑Spark (aarch64)

Below is a step‑by‑step guide based on my own experience.

1. Environment Check

First, verify the specs of your system. I used a DGX‑Spark equipped with a Grace Blackwell GPU.

Item Details
OS/Arch Linux aarch64
GPU NVIDIA GB10 (Blackwell)
CUDA 13.0 (V13.0.88)
Python 3.12.3
Compute Cap 12.1
# Check GPU info
nvidia-smi
# Check CUDA version
nvcc --version

2. Fill the Gaps: Install cuDNN

Because cuDNN wasn’t present on my system, I first installed the Python‑package version and then configured the build scripts to use it.

# Install cuDNN for CUDA 13
pip install nvidia-cudnn-cu13

# Locate the installation path (using a small Python snippet)
python3 -c "import site, os; print(os.path.join(site.getsitepackages()[0], 'nvidia/cudnn'))"

Next, export the necessary environment variables so the build can locate cuDNN.

export CUDA_HOME=/usr/local/cuda
export CUDNN_HOME=/home/jesse/onnxruntime/venv/lib/python3.12/site-packages/nvidia/cudnn
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDNN_HOME/lib:$LD_LIBRARY_PATH
export CUDACXX=$CUDA_HOME/bin/nvcc

3. Build ONNX Runtime from Source

Now comes the core step. Run the build.sh script, specifying the Blackwell architecture (121) and the custom CUDA paths.

./build.sh \
  --config Release \
  --update --build \
  --parallel \
  --build_wheel \
  --use_cuda \
  --cuda_home $CUDA_HOME \
  --cudnn_home $CUDNN_HOME \
  --skip_tests \
  --cmake_generator Ninja \
  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=121 \
  --cmake_extra_defines CMAKE_CUDA_FLAGS="-I/usr/local/cuda/include/cccl" \
  --cmake_extra_defines CMAKE_CXX_FLAGS="-I/usr/local/cuda/include/cccl"

Tip: Adding CMAKE_CUDA_FLAGS helps avoid include‑path errors related to CCCL.

4. Verify the Output and Install

When the build finishes, a wheel file for aarch64 appears under build/Linux/Release/dist/.

# List the generated wheel
ls -lh build/Linux/Release/dist/onnxruntime_gpu-*.whl

# Install it
pip install build/Linux/Release/dist/onnxruntime_gpu-1.25.0-cp312-cp312-linux_aarch64.whl

5. Final Validation

Check that the library correctly detects the GPU.

import onnxruntime as ort
print("ORT version:", ort.__version__)
print("Available providers:", ort.get_available_providers())

Result: If you see ['CUDAExecutionProvider', 'CPUExecutionProvider'], you’re good to go! 🎉


Closing Thoughts

The wheel you built can later be copied into a Docker image with a simple COPY command, making it reusable across deployments. Building for aarch64 and the latest CUDA stacks can be fiddly, but once you have a solid build, you’ll enjoy unmatched performance when serving generative AI models.

I hope this guide shines a light for fellow DGX‑Spark users!