Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion A…

Interfaze, a young YC startup, has open-sourced a new speech recognition model called diffusion-gemma-asr-small, described as the first open-source multilingual diffusion ASR model. The model transcribes audio through a diffusion decoder, not an autoregressive one, using DiffusionGemma's parallel denoising decoder. One adapter handles six languages, with the research team training only about 42M parameters on top of a frozen 26B backbone, roughly 0.16% of the model’s weights. The model uses uniform, random-token diffusion rather than the absorbing <mask> scheme. Transcription cost scales with denoising steps, not transcript length. It leads diffusion peers on LibriSpeech with 6.6% WER versus Whisfusion’s 8.3% but trails autoregressive Whisper. The adapter ships under Apache-2.0; DiffusionGemma (Gemma terms) and whisper-small (MIT) load separately. The decoder belongs to DiffusionGemma, Google’s 26B mixture-of-experts model, which activates 4B parameters using 128 experts with top-8 routing. Interfaze added audio to this text-only model by using a frozen whisper-small encoder as a feature extractor, turning 30 seconds of audio into 1500 frames with 768-dimensional acoustic features. A small trainable projector compresses these frames using conv layers that subsample 8× plus a linear map, output