Revision Ver-2

Porting Ditto TalkingHead (DGX Spark / ARM64) to TensorRT – A Detailed Log

Purpose

To run the ditto-talkinghead project on a DGX Spark (ARM64 / aarch64) system, I converted the existing ONNX models into TensorRT engines compatible with this environment and verified inference.

Because warp_network.onnx depends on a custom GridSample3D plugin, the core challenge was loading a TensorRT custom plugin (.so) built for the target architecture.

image of the porting process on DGX-Spark

Why This Was Necessary

The Ditto checkpoint includes libgrid_sample_3d_plugin.so, but the binary is compiled for x86‑64.

My setup:

DGX Spark
ARM64 (aarch64)
TensorRT 10.14.1
CUDA 13.1

Since an x86 plugin cannot be loaded on an ARM TensorRT runtime, the parser failed on warp_network.onnx and the TRT engine could not be created.

Symptoms (Key Failure Log)

During TRT conversion, warp_network produced the following errors:

Unable to load library: libgrid_sample_3d_plugin.so
Plugin not found ... GridSample3D
Fail parsing warp_network.onnx
Final error: Network must have at least one output

The root cause was the failure to load the GridSample3D plugin.

Root‑Cause Analysis

1) Verify plugin file existence

The file was indeed present at:

./checkpoints/ditto_onnx/libgrid_sample_3d_plugin.so

So it wasn’t a simple path issue.

2) Check with `ldd`

Running ldd ./checkpoints/ditto_onnx/libgrid_sample_3d_plugin.so returned:

not a dynamic executable

This hinted at an architecture mismatch.

3) Confirm architecture with `file` / `readelf`

The decisive evidence:

file output: ELF 64-bit LSB shared object, x86-64
readelf -h output: Machine: Advanced Micro Devices X86-64

The supplied .so is an x86‑64 binary, which cannot be loaded on ARM64.

Solution (Key Insight)

Conclusion

The GridSample3D TensorRT plugin must be rebuilt for ARM64.

Step‑by‑Step Procedure

1) Obtain plugin source

The ditto-talkinghead repository only provided the binary, so I fetched the source from a separate repo:

grid-sample3d-trt-plugin

Key source files:

grid_sample_3d_plugin.cpp
grid_sample_3d_plugin.h
grid_sample_3d.cu
grid_sample_3d.cuh

2) Verify TensorRT / CUDA environment

The environment details:

TensorRT Python: 10.14.1.48
TensorRT library: /usr/lib/aarch64-linux-gnu/libnvinfer.so
CUDA: 13.1
GPU capability: (12, 1) → CUDA arch 121

3) Adjust CMake build settings

The original CMake file hard-coded GPU architectures specific to x86 and caused several issues:

Forced compute_70 (unsupported by the current nvcc)
Missing include path for cuda_fp16.h
No explicit TensorRT lib directory

Fixes applied

Removed hard‑coded CUDA_ARCHITECTURES
Added TensorRT include and lib paths
Specified CUDA include path (/usr/local/cuda/targets/sbsa-linux/include)
Disabled optional test subdirectory build

4) Build the ARM64 plugin successfully

cd /workspace/grid-sample3d-trt-plugin
rm -rf build
mkdir build && cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DTensorRT_ROOT=/usr \
  -DTensorRT_INCLUDE_DIR=/usr/include/aarch64-linux-gnu \
  -DTensorRT_LIB_DIR=/usr/lib/aarch64-linux-gnu \
  -DCMAKE_CUDA_ARCHITECTURES=121

cmake --build . -j"$(nproc)"

The build produced:

build/libgrid_sample_3d_plugin.so

Confirm the binary is ARM64:

file build/libgrid_sample_3d_plugin.so

Expected output example:

ELF 64-bit LSB shared object, ARM aarch64, …

5) Replace the checkpoint’s x86 plugin

cp /workspace/grid-sample3d-trt-plugin/build/libgrid_sample_3d_plugin.so \
   /workspace/ditto-talkinghead/checkpoints/ditto_onnx/libgrid_sample_3d_plugin.so

6) Verify TensorRT plugin loading

Loaded the rebuilt plugin directly through the TensorRT Python plugin registry and confirmed it worked.

7) Rerun ONNX → TensorRT conversion

Running cvt_onnx_to_trt.py now succeeded for the entire model, including warp_network.onnx.

Inference also succeeded.

Current Status

✅ GridSample3D custom TensorRT plugin runs on ARM64
✅ warp_network.onnx parses correctly
✅ ONNX → TensorRT engine conversion succeeds
✅ Ditto TalkingHead inference works on DGX Spark / ARM64

In short, the project is fully ported to the DGX Spark platform.

Troubleshooting Notes

1) Checkpoint‑bundled `.so` files are platform‑specific

A checkpoint may contain binaries compiled for a different ISA. If you’re on an aarch64 machine, suspect an x86 binary first.

2) Plugin rebuild may be required after TensorRT upgrades

When moving to a new major TensorRT version, API/ABI changes can break existing plugins.

3) Avoid hard‑coding CUDA architectures in `CMakeLists.txt`

Values like 70;80;86;89 will fail on newer GPUs. Use the actual device capability, e.g., (12,1) → 121.

python - <<'PY'
import torch
print(torch.cuda.get_device_capability())
PY

Environment Summary

Platform: DGX Spark
Architecture: aarch64 (ARM64)
CUDA: 13.1
TensorRT: 10.14.1
Python: 3.12
GPU capability: 12.1 (CMake CUDA arch = 121)

Related posts

Messy HunyuanVideo‑Avatar Challenge