Mistral Releases A New Open-source Model For Speech Generation

6 days ago

French AI institution Mistral released a caller open-source text-to-speech exemplary connected Thursday that tin beryllium utilized by sound AI assistants aliases successful endeavor usage cases for illustration customer support. The model, which lets enterprises build sound agents for income and customer engagement, puts Mistral successful nonstop title pinch nan likes of ElevenLabs, Deepgram, and OpenAI.

The caller model, called Voxtral TTS, supports 9 languages, including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

“Our customers person been asking for a reside model. So we built a small-sized reside exemplary that tin fresh connected a smartwatch, a smartphone, a laptop, aliases different separator devices. The costs of it is simply a fraction of thing other connected nan market, but it offers state-of-the-art performance,” Pierre Stock, vp of subject operations astatine Mistral AI, told TechCrunch during a telephone interview.

Mistral said nan caller exemplary tin accommodate a civilization sound pinch a sample of little than 5 seconds, and besides seizure characteristics for illustration subtle accents, inflections, intonations, and irregularities successful nan travel of speech. The model, based connected Ministral 3B, tin move betwixt languages easy without losing nan characteristics of nan voice, which is useful for usage cases for illustration dubbing aliases real-time translation. Stock said nan institution wanted nan exemplary to sound quality and not robotic.

The exemplary has been built for real-time performance, according to nan company. It has a time-to-first-audio (TTFA) — a measurement of erstwhile nan exemplary starts ‘speaking’ aft receiving input — of 90ms for a 10-second sample of 500 characters. The exemplary besides has a real-time facet (RTF) of 6x, which intends it tin render a 10-second clip successful astir 1.6 seconds.

Earlier this year, Mistral launched a brace of transcription models, 1 for ample batch processing and nan different for real-time usage cases pinch debased latency. With nan caller reside model, nan institution is apt aiming to supply a afloat suite of sound products to enterprises.

“We scheme to person an end-to-end level that tin grip multimodal streams of input, including audio, text, and image and output arsenic well. The main use of that is you get measurement much accusation pinch an end-to-end agentic strategy that supports audio arsenic an input aliases output,” Stock said.

Techcrunch event

San Francisco, CA | October 13-15, 2026

Mistral’s positioning is that its unfastened root and customization spot will thief enterprises adopt its sound models complete competitors, arsenic they tin tune it nan measurement they want.