Gemini 3.1 Flash TTS Launch and Availability

Google has introduced Gemini 3.1 Flash TTS, a new text-to-speech model positioned as its most expressive and controllable model so far. The release is available in preview and can be accessed through the Gemini API, Google AI Studio, Vertex AI, and Google Vids for Workspace users.

This rollout places the model directly into Google’s existing AI ecosystem, giving developers and Workspace users multiple ways to test and build with it.

Gemini 3.1 Flash TTS Audio Control Features

More Than 200 Audio Tags for Fine-Grained Voice Control

One of the biggest additions in Gemini 3.1 Flash TTS is support for more than 200 audio tags. These tags can be embedded directly into text input, allowing developers to shape how generated speech sounds with much more precision.

The tags can control:

  • vocal style
  • pacing
  • accent
  • emotional expression

Google describes this as an “authorial” approach to audio generation, which points to a more deliberate and nuanced way to direct speech output.

Emotion and Delivery Cues Built Into Text Input

The available tags include emotional cues such as “determination” and “curiosity,” along with delivery instructions like “whispers” and “laughs.” That combination gives developers a way to guide not just what is spoken, but how it is performed.

And that matters. It turns plain text input into something closer to a performance script, which opens up more expressive possibilities for voice-driven experiences.

Language Support and Voice Options in Gemini 3.1 Flash TTS

Support for More Than 70 Languages

Gemini 3.1 Flash TTS supports over 70 languages. The listed examples include Hindi, Japanese, and German, showing that the model is built for broad multilingual use rather than a narrow English-only setup.

30 Prebuilt Voices as Starting Points

The model includes 30 prebuilt voices that can be used as starting points. That gives developers a ready-made base to work from while still taking advantage of the model’s deeper control features.

Native Multi-Speaker Dialogue for Conversational Audio

Gemini 3.1 Flash TTS also supports multi-speaker dialogue natively. Instead of requiring separate API calls for different voices, the model can handle conversations in a more seamless way while maintaining natural flow.

That native support is especially relevant for use cases such as:

  • podcasts
  • dramatic scripts
  • assistant interfaces

For teams building conversational experiences, this removes friction from voice orchestration and makes dialogue generation feel more natural.

Gemini 3.1 Flash TTS Leaderboard Performance

Google AI Studio reported that Gemini 3.1 Flash TTS reached an Elo score of 1,211 on the Artificial Analysis TTS leaderboard.

Artificial Analysis also noted that the model ranked second on its Speech Arena Leaderboard, placing it ahead of ElevenLabs’ Eleven v3.x. That result gives the launch an added layer of credibility, especially for developers comparing current text-to-speech options by performance and quality.

SynthID Watermarking and AI Audio Identification

Built-In Watermarking for AI-Generated Audio

All audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID, Google’s imperceptible watermarking technology. The purpose is to identify AI-generated content and help reduce misinformation risks.

Watermarking Without Audio Quality Degradation

According to Google, the watermark is embedded without degrading audio quality. That detail is important because it positions watermarking as a default safeguard rather than a tradeoff that weakens the listening experience.

Gemini 3.1 Flash TTS API Access and Model Limits

Developers can access the model using the gemini-3.1-flash-tts-preview model ID in the Gemini API.

The stated limits are:

  • input token limit: 8,192
  • output token limit: 16,384

These specifics give developers a clearer sense of how the model fits into production workflows and prompt design.

Relationship to Gemini 3.1 Flash Live

The release follows the March 25 launch of Gemini 3.1 Flash Live, Google’s real-time dialogue model built for voice-first AI applications.

That timing suggests Google is continuing to expand its voice AI stack with models designed for different kinds of speech experiences, including real-time dialogue and expressive text-to-speech generation.