# Installing MOSS-TTS: High-Performance TTS Without Fine-tuning (Leveraging NVIDIA DGX Spark) In this post, I'll share the process of setting up **MOSS-TTS**, a high-performance Text-to-Speech (TTS) model, on NVIDIA's latest AI workstation, the **DGX Spark (Grace-Blackwell)**. A key highlight is its impressive voice cloning performance, which doesn't require any fine-tuning. ![moss-tts-on-dgx-spark](/media/whitedec/blog_img/e24d4416693f4aeaae267eecfa832122.webp) ## System Environment Summary {#sec-f34b9d0d948e} This installation guide is based on the following environment. | Category | Specification | Notes | | --- | --- | --- | | **Hardware** | NVIDIA DGX Spark (Grace-Blackwell) | Low-power/Low-noise AI Workstation | | **GPU** | GB10 (CUDA Capability 12.1) | Blackwell Architecture | | **OS** | Ubuntu 22.04 LTS 기반 | - | | **CUDA / Driver** | CUDA 13.0 | Spark default driver environment | | **Python** | 3.10+ (venv 사용) | Chose lightweight venv over Conda | | **VRAM Usage** | 약 23.8 GB | Based on inference standby and execution | --- ## 1. Repository Clone {#sec-ef4f934465f5} ```bash git clone https://github.com/OpenMOSS/MOSS-TTS.github ``` ## 2. Create and Activate Virtual Environment (venv) {#sec-919dd44853da} While the GitHub guide recommends Conda, I opted for a Python virtual environment for easier Docker packaging and `systemd` service registration in the future. ```bash python3 -m venv myvenv source myvenv/bin/activate ``` ## 3. Update Basic Build Tools {#sec-93572d851f38} ```bash pip install -U pip setuptools wheel ``` ## 4. Crucial Configuration: Modify pyproject.toml {#sec-20dfda9d709e} To align with the **CUDA 13.0** environment of DGX Spark, you must manually adjust the dependency versions. It's critical to **ensure that the versions of torch and torchaudio match** to prevent conflicts during installation. * **Modifications:** * `"torch==2.10.0+cu130"` * `"torchaudio==2.10.0+cu130"` * `"torchcodec==0.10.0+cu130"` ## 5. Install Dependency Packages {#sec-946338ef1888} ```bash pip install --extra-index-url https://download.pytorch.org/whl/cu130 -e . ``` ## 6. Install Host FFmpeg {#sec-4245e15027ca} FFmpeg is required to prevent errors during inference, so install it along with its libraries beforehand. ```bash sudo apt update && sudo apt install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev libswscale-dev sudo ldconfig ``` ## 7. Model Download and Execution Precautions {#sec-aa11431d729} * **Do Not Download Directly from HG:** Manually downloading the model from HuggingFace and linking it via `--model_path` may cause errors due to variable name mismatches with the executable (`moss_tts_app.py`). * **Automatic Download Recommended:** Running with default settings will automatically download approximately 17GB of weight models and 7GB of Tokenizer to the cache path. * **Execution Script:** ```bash python clis/moss_tts_app.py --device cuda --attn_implementation auto --host 0.0.0.0 --port 7860 ``` > **Note:** A `GB10 (cuda capability 12.1)` warning might appear during execution, but it has been confirmed that this does not affect actual inference performance. Initial loading takes approximately 30-60 seconds. ## 8. Post-Use Review: "Fine-tuning is No Longer Necessary" {#sec-cfb473ec3c5d} * **Cloning Performance:** With just one sample of my voice, it perfectly replicates the tone and speech patterns in Korean, English, and Japanese. * **Speed:** Short sentences are generated in 7-8 seconds, while longer texts (3-4 sentences) take approximately 30 seconds. * **Language-Specific Features:** English generation is nearly flawless. Misread Kanji in Japanese can be easily corrected by using Hiragana. * **Power Efficiency:** Even during inference, it maintains a low power consumption of around **36W**, and the absence of fan noise is a significant advantage of the DGX Spark. --- ## 9. Troubleshooting: Why Not Use NVIDIA's Official Image? {#sec-e2798483c6b5} The `nvcr.io/nvidia/pytorch:26.01-py3` image provided by NVIDIA does not allow `torchaudio` and `torchcodec` to be built, which are necessary for TTS operation. It appears that the PyTorch version included in NVIDIA's images is specially built to match NVIDIA products, leading to version mismatch issues with `torchaudio` and `torchcodec`. Therefore, for now, a standard `venv` environment is the most stable option even on Spark. --- ## 🚀 Future Plans {#sec-620285e7d87d} * **FlashAttention 2 Integration:** I plan to test how much inference speed improves after implementing FlashAttention 2. * **MOSS-VoiceGenerator:** I also plan to explore the MOSS-VoiceGenerator model, which creates new virtual voices without a reference. --- **Related Posts** -[NVIDIA DGX Spark - A New Standard for On-Premise AI Infrastructure](/ko/whitedec/2025/5/12/nvidia-dgx-spark-ai-infra/)