# Installing MOSS-TTS: High-Performance TTS Without Fine-tuning (Leveraging NVIDIA DGX Spark)

In this post, I'll share the process of setting up **MOSS-TTS**, a high-performance Text-to-Speech (TTS) model, on NVIDIA's latest AI workstation, the **DGX Spark (Grace-Blackwell)**. A key highlight is its impressive voice cloning performance, which doesn't require any fine-tuning.


![moss-tts-on-dgx-spark](/media/whitedec/blog_img/e24d4416693f4aeaae267eecfa832122.webp)

## System Environment Summary {#sec-f34b9d0d948e}

This installation guide is based on the following environment.

| Category | Specification | Notes |
| --- | --- | --- |
| **Hardware** | NVIDIA DGX Spark (Grace-Blackwell) | Low-power/Low-noise AI Workstation |
| **GPU** | GB10 (CUDA Capability 12.1) | Blackwell Architecture |
| **OS** | Ubuntu 22.04 LTS 기반 | - |
| **CUDA / Driver** | CUDA 13.0 | Spark default driver environment |
| **Python** | 3.10+ (venv 사용) | Chose lightweight venv over Conda |
| **VRAM Usage** | 약 23.8 GB | Based on inference standby and execution |

---

## 1. Repository Clone {#sec-ef4f934465f5}

```bash
git clone https://github.com/OpenMOSS/MOSS-TTS.github
```


## 2. Create and Activate Virtual Environment (venv) {#sec-919dd44853da}

While the GitHub guide recommends Conda, I opted for a Python virtual environment for easier Docker packaging and `systemd` service registration in the future.

```bash
python3 -m venv myvenv
source myvenv/bin/activate
```

## 3. Update Basic Build Tools {#sec-93572d851f38}

```bash
pip install -U pip setuptools wheel

```

## 4. Crucial Configuration: Modify pyproject.toml {#sec-20dfda9d709e}

To align with the **CUDA 13.0** environment of DGX Spark, you must manually adjust the dependency versions. It's critical to **ensure that the versions of torch and torchaudio match** to prevent conflicts during installation.

*   **Modifications:**
*   `"torch==2.10.0+cu130"`
*   `"torchaudio==2.10.0+cu130"`
*   `"torchcodec==0.10.0+cu130"`


## 5. Install Dependency Packages {#sec-946338ef1888}

```bash
pip install --extra-index-url https://download.pytorch.org/whl/cu130 -e .

```

## 6. Install Host FFmpeg {#sec-4245e15027ca}

FFmpeg is required to prevent errors during inference, so install it along with its libraries beforehand.

```bash
sudo apt update && sudo apt install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev libswscale-dev
sudo ldconfig

```

## 7. Model Download and Execution Precautions {#sec-aa11431d729}

*   **Do Not Download Directly from HG:** Manually downloading the model from HuggingFace and linking it via `--model_path` may cause errors due to variable name mismatches with the executable (`moss_tts_app.py`).
*   **Automatic Download Recommended:** Running with default settings will automatically download approximately 17GB of weight models and 7GB of Tokenizer to the cache path.
*   **Execution Script:**

```bash
python clis/moss_tts_app.py --device cuda --attn_implementation auto --host 0.0.0.0 --port 7860
```

> **Note:** A `GB10 (cuda capability 12.1)` warning might appear during execution, but it has been confirmed that this does not affect actual inference performance. Initial loading takes approximately 30-60 seconds.

## 8. Post-Use Review: "Fine-tuning is No Longer Necessary" {#sec-cfb473ec3c5d}

*   **Cloning Performance:** With just one sample of my voice, it perfectly replicates the tone and speech patterns in Korean, English, and Japanese.
*   **Speed:** Short sentences are generated in 7-8 seconds, while longer texts (3-4 sentences) take approximately 30 seconds.
*   **Language-Specific Features:** English generation is nearly flawless. Misread Kanji in Japanese can be easily corrected by using Hiragana.
*   **Power Efficiency:** Even during inference, it maintains a low power consumption of around **36W**, and the absence of fan noise is a significant advantage of the DGX Spark.

---

## 9. Troubleshooting: Why Not Use NVIDIA's Official Image? {#sec-e2798483c6b5}

The `nvcr.io/nvidia/pytorch:26.01-py3` image provided by NVIDIA does not allow `torchaudio` and `torchcodec` to be built, which are necessary for TTS operation. It appears that the PyTorch version included in NVIDIA's images is specially built to match NVIDIA products, leading to version mismatch issues with `torchaudio` and `torchcodec`. Therefore, for now, a standard `venv` environment is the most stable option even on Spark.

---

## 🚀 Future Plans {#sec-620285e7d87d}

*   **FlashAttention 2 Integration:** I plan to test how much inference speed improves after implementing FlashAttention 2.
*   **MOSS-VoiceGenerator:** I also plan to explore the MOSS-VoiceGenerator model, which creates new virtual voices without a reference.

---

**Related Posts**

-[NVIDIA DGX Spark - A New Standard for On-Premise AI Infrastructure](/ko/whitedec/2025/5/12/nvidia-dgx-spark-ai-infra/)