# ONNX Runtime GPU: Why Build It Yourself?

Before diving into the actual build, let’s clarify a few concepts.

## **What is onnxruntime_gpu?** {#sec-9ff1a09e9e89}

**ONNX Runtime (ORT)** is an optimized AI inference engine. After converting a model built with PyTorch or TensorFlow to the `.onnx` format, ORT helps you run it as fast as possible on a variety of hardware. The `onnxruntime_gpu` variant leverages NVIDIA GPU **CUDA** and **cuDNN** to squeeze out maximum performance.

![image of onnxruntime](/media/whitedec/blog_img/35245efd929e4132a695bf5bd2ad1d45.webp)

## Why Is It Commonly Used for Generative AI Models (Images/Video)? {#sec-5c0018e3e915}

Models like Stable Diffusion or Whisper are massive and computationally intensive. ORT offers several optimizations:

- **Graph Optimization:** Removes redundant operations and fuses multiple operators into a single one.
- **Memory Management:** Handles GPU memory allocation efficiently so even large models run smoothly.
- **Hardware Acceleration:** When paired with accelerators such as TensorRT, it can deliver several‑fold speedups over pure PyTorch.

## Why Do ARM/aarch64 Users Need to Build from Source? {#sec-665d2400b756}

Running `pip install onnxruntime-gpu` works fine on most platforms, but **aarch64** is a different story.

1. **No Pre‑built Binaries:** The wheels on PyPI are primarily built for x86_64.
2. **Cut‑Edge Architecture Support:** Very new stacks like **CUDA 13.0** or **Compute Capability 12.1 (Blackwell)** often aren’t covered by official releases yet.
3. **Tailored Optimizations:** If you want to fine‑tune the build for a specific machine (e.g., DGX Spark), compiling yourself is the only way.

---

## Building ONNX Runtime GPU on DGX‑Spark (aarch64) {#sec-556ead64bd44}

Below is a step‑by‑step guide based on my own experience.

### 1. Environment Check {#sec-dbf99788fe3f}

First, verify the specs of your system. I used a DGX‑Spark equipped with a Grace Blackwell GPU.

| Item | Details |
| --- | --- |
| **OS/Arch** | Linux aarch64 |
| **GPU** | NVIDIA GB10 (Blackwell) |
| **CUDA** | 13.0 (V13.0.88) |
| **Python** | 3.12.3 |
| **Compute Cap** | 12.1 |

```bash
# Check GPU info
nvidia-smi
# Check CUDA version
nvcc --version
```

### 2. Fill the Gaps: Install cuDNN {#sec-6aab9c2c470c}

Because cuDNN wasn’t present on my system, I installed the Python‑package version first and then pointed the build scripts to it.

```bash
# Install cuDNN for CUDA 13
pip install nvidia-cudnn-cu13

# Locate the installation path (using a small Python snippet)
python3 -c "import site, os; print(os.path.join(site.getsitepackages()[0], 'nvidia/cudnn'))"
```

Next, export the necessary environment variables so the build can locate cuDNN.

```bash
export CUDA_HOME=/usr/local/cuda
export CUDNN_HOME=/home/jesse/onnxruntime/venv/lib/python3.12/site-packages/nvidia/cudnn
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDNN_HOME/lib:$LD_LIBRARY_PATH
export CUDACXX=$CUDA_HOME/bin/nvcc
```

### 3. Build ONNX Runtime from Source {#sec-cad4f78d6b48}

Now comes the core step. Run the `build.sh` script, specifying the Blackwell architecture (`121`) and the custom CUDA paths.

```bash
./build.sh \
  --config Release \
  --update --build \
  --parallel \
  --build_wheel \
  --use_cuda \
  --cuda_home $CUDA_HOME \
  --cudnn_home $CUDNN_HOME \
  --skip_tests \
  --cmake_generator Ninja \
  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=121 \
  --cmake_extra_defines CMAKE_CUDA_FLAGS="-I/usr/local/cuda/include/cccl" \
  --cmake_extra_defines CMAKE_CXX_FLAGS="-I/usr/local/cuda/include/cccl"
```

> **Tip:** Adding `CMAKE_CUDA_FLAGS` helps avoid include‑path errors related to `CCCL`.

### 4. Verify the Output and Install {#sec-410f01ad96da}

When the build finishes, a wheel file for `aarch64` appears under `build/Linux/Release/dist/`.

```bash
# List the generated wheel
ls -lh build/Linux/Release/dist/onnxruntime_gpu-*.whl

# Install it
pip install build/Linux/Release/dist/onnxruntime_gpu-1.25.0-cp312-cp312-linux_aarch64.whl
```

### 5. Final Validation {#sec-d3ac8cb2ba46}

Check that the library correctly detects the GPU.

```python
import onnxruntime as ort
print("ORT version:", ort.__version__)
print("Available providers:", ort.get_available_providers())
```

**Result:** If you see `['CUDAExecutionProvider', 'CPUExecutionProvider']`, you’re good to go! 🎉

---

## Closing Thoughts {#sec-dfe0bb0b5f26}

The wheel you built can later be copied into a Docker image with a simple `COPY` command, making it reusable across deployments. Building for `aarch64` and the latest CUDA stacks can be fiddly, but once you have a solid build, you’ll enjoy unmatched performance when serving generative AI models.

I hope this guide shines a light for fellow DGX‑Spark users!