How to Launch VoxCPM2

Jun

Docker offers the quickest path to setting up this model locally.

Make sure to follow the instructions below.

Then, execute the docker-compose up command to launch the model.

🔗 SHA sum: e103e9b136be44ed5abcac181a04d1fb | Updated: 2026-06-23

CPU: multi-threading optimized for fast prompt processing
RAM: 32 GB highly recommended for 26B+ GGUF models
Disk: 150+ GB for high-context vector database storage
Graphics: TensorRT-LLM / vLLM inference engine compatible chip

VoxCPM2 is a next‑generation speech synthesis model designed to generate highly natural‑sounding audio across dozens of languages. It leverages a conditional parameterization approach that reduces memory footprint by up to 60 % while preserving voice fidelity. The architecture integrates a hierarchical encoder and a diffusion‑based decoder, enabling real‑time inference with latency under 150 ms on standard hardware. A built‑in speaker adaptation module allows users to personalize voice models with just a few seconds of audio, eliminating the need for extensive retraining. These capabilities are showcased in a comparative benchmark where VoxCPM2 outperforms prior models on MOS scores, word error rates, and multilingual consistency, as detailed in the table below.