High-Performance MOSS-TTS Without Fine-tuning: NVIDIA DGX Spark Installation Guide

Installing MOSS-TTS: High-Performance TTS Without Fine-tuning (Leveraging NVIDIA DGX Spark)

In this post, I'll share the process of setting up MOSS-TTS, a high-performance Text-to-Speech (TTS) model, on NVIDIA's latest AI workstation, the DGX Spark (Grace-Blackwell). A key highlight is its impressive voice cloning performance, which doesn't require any fine-tuning.

moss-tts-on-dgx-spark

System Environment Summary

This installation guide is based on the following environment.

Category	Specification	Notes
Hardware	NVIDIA DGX Spark (Grace-Blackwell)	Low-power/Low-noise AI Workstation
GPU	GB10 (CUDA Capability 12.1)	Blackwell Architecture
OS	Ubuntu 22.04 LTS 기반	-
CUDA / Driver	CUDA 13.0	Spark default driver environment
Python	3.10+ (venv 사용)	Chose lightweight venv over Conda
VRAM Usage	약 23.8 GB	Based on inference standby and execution

1. Repository Clone

git clone https://github.com/OpenMOSS/MOSS-TTS.github

2. Create and Activate Virtual Environment (venv)

While the GitHub guide recommends Conda, I opted for a Python virtual environment for easier Docker packaging and systemd service registration in the future.

python3 -m venv myvenv
source myvenv/bin/activate

3. Update Basic Build Tools

pip install -U pip setuptools wheel

4. Crucial Configuration: Modify pyproject.toml

To align with the CUDA 13.0 environment of DGX Spark, you must manually adjust the dependency versions. It's critical to ensure that the versions of torch and torchaudio match to prevent conflicts during installation.

Modifications:
"torch==2.10.0+cu130"
"torchaudio==2.10.0+cu130"
"torchcodec==0.10.0+cu130"

5. Install Dependency Packages

pip install --extra-index-url https://download.pytorch.org/whl/cu130 -e .

6. Install Host FFmpeg

FFmpeg is required to prevent errors during inference, so install it along with its libraries beforehand.

sudo apt update && sudo apt install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev libswscale-dev
sudo ldconfig

7. Model Download and Execution Precautions

Do Not Download Directly from HG: Manually downloading the model from HuggingFace and linking it via --model_path may cause errors due to variable name mismatches with the executable (moss_tts_app.py).
Automatic Download Recommended: Running with default settings will automatically download approximately 17GB of weight models and 7GB of Tokenizer to the cache path.
Execution Script:

python clis/moss_tts_app.py --device cuda --attn_implementation auto --host 0.0.0.0 --port 7860

Note: A GB10 (cuda capability 12.1) warning might appear during execution, but it has been confirmed that this does not affect actual inference performance. Initial loading takes approximately 30-60 seconds.

8. Post-Use Review: "Fine-tuning is No Longer Necessary"

Cloning Performance: With just one sample of my voice, it perfectly replicates the tone and speech patterns in Korean, English, and Japanese.
Speed: Short sentences are generated in 7-8 seconds, while longer texts (3-4 sentences) take approximately 30 seconds.
Language-Specific Features: English generation is nearly flawless. Misread Kanji in Japanese can be easily corrected by using Hiragana.
Power Efficiency: Even during inference, it maintains a low power consumption of around 36W, and the absence of fan noise is a significant advantage of the DGX Spark.

9. Troubleshooting: Why Not Use NVIDIA's Official Image?

The nvcr.io/nvidia/pytorch:26.01-py3 image provided by NVIDIA does not allow torchaudio and torchcodec to be built, which are necessary for TTS operation. It appears that the PyTorch version included in NVIDIA's images is specially built to match NVIDIA products, leading to version mismatch issues with torchaudio and torchcodec. Therefore, for now, a standard venv environment is the most stable option even on Spark.

🚀 Future Plans

FlashAttention 2 Integration: I plan to test how much inference speed improves after implementing FlashAttention 2.
MOSS-VoiceGenerator: I also plan to explore the MOSS-VoiceGenerator model, which creates new virtual voices without a reference.