Text-to-Speech AI Edits Single Words Mid-Recording: ViiTorVoice Goes …

A Chinese startup called Yunshang Qulv released an open-source text-to-speech model on July 1, 2026, named ViiTorVoice-NAR, available on GitHub and Hugging Face under an Apache 2.0 license. The model replaces a single word inside a finished audio recording without regenerating anything around it, delivering first-frame audio in under 60 milliseconds. It targets only the words that changed by using the surrounding audio as context. The non-autoregressive design, built on masked discrete audio tokens, also enables reference-text-free voice cloning from a raw audio clip, demonstrated using clips of professional athletes. This capability arrives thirty-one days before the EU AI Act's mandatory audio-labeling deadline on August 2, 2026, in a freely downloadable open-source package with no technical consent mechanism built in.