How to Choose STT & TTS for AI Voice Agents in 2025: A Comprehensive Guide

Learn how to choose Speech-to-Text and Text-to-Speech technologies for building a voice agent tailored to your use case. We break down key metrics and the most popular models.

How to Choose STT & TTS for AI Voice Agents in 2025: A Comprehensive Guide

Speech-to-Text (STT) and Text-to-Speech (TTS) technologies, combined with Large Language Models (LLM), are the backbone of modern AI voice agents. These technologies work together to transform spoken language into meaningful interactions, making digital communication more natural and intuitive.

While direct Speech-to-Speech solutions are emerging, the current STT→LLM→TTS approach remains the most flexible. This method allows businesses to easily switch language models based on task complexity, providing greater adaptability.

But how do you choose the STT and TTS that best suit your purposes? In this article, we will look at what criteria you should pay attention to, as well as what characteristics the most popular models have today.


Understanding STT and TTS technologies

The voice agent interaction cycle works simply:

  1. STT captures voice input and converts it to text.
  2. An LLM generates an appropriate response.
  3. TTS converts the response back into natural-sounding speech.

Modern STT models leverage advanced deep learning architectures, particularly transformer-based models. The conversion process involves several stages: audio preprocessing, feature extraction, and sequence modeling to transform acoustic signals into precise text output.

Similarly, contemporary TTS systems employ neural networks in a two-step process. The system first converts text into spectrograms (visual representations of sound) and then transforms these spectrograms into audio waveforms that closely mimic human speech patterns.

Text-to-Speech (TTS) models have reached a mature stage, requiring only “polishing”: addressing minor bugs, reducing costs, optimizing device compatibility, and enhancing overall stability.

Speech-to-Text (STT) technologies still have room for improvement. Developers focus on critical challenges like maintaining accuracy in challenging environments, recognizing multiple speakers simultaneously reliably, and isolating target voices amid background noise. Encouragingly, these are active development priorities for technology providers, promising significant advancements in the near future.


What to look for when choosing speech technologies

When evaluating speech technologies for your business, it's important to understand that STT and TTS models serve different purposes and therefore require different evaluation approaches.

To help evaluate these criteria objectively, several industry-standard metrics can guide your decision-making process. Let's look at each criterion and its associated measurements.

Speech-to-Text (STT) selection criteria

Accuracy and recognition capabilities

The primary concern for STT models is their ability to accurately transcribe spoken language. This can be measured using Word Error Rate (WER), an industry standard metric showing the percentage of transcription errors. Good models achieve 5–10% WER, meaning the accuracy is 90–95%.

Key considerations also include accuracy rates for different accents, background noise handling, and specialized vocabulary recognition. For instance, a customer service application needs robust handling of various accents and casual speech patterns, while healthcare applications require precise medical terminology recognition.

The ability to recognize the voices of different people can also be useful. This is relevant for scenarios where we receive audio from many talkers on the same audio channel.

Processing speed and latency

Real-time transcription capabilities are crucial for STT models, especially in interactive applications. Key performance metrics include Real-Time Factor (RTF). It measures how much faster the processing is than real time.

An RTF of 1 means that the processing time is exactly the same as the duration of the audio. An RTF of 0.1 indicates that the system processes audio much faster than real-time, meaning it takes only 10% of the audio duration to transcribe it. This is considered excellent for real-time applications, as it allows for rapid transcription and immediate feedback.

Also worth looking at:

  • First Response Latency: The time at which transcription generation begins. For optimal performance in real-time STT applications, aiming for latencies under 100 ms is ideal, while values up to 200–500 ms can be acceptable depending on the context. Anything above 1 second is generally considered too high for effective interaction.
  • Speech Completion Detection: Accuracy in determining when the user has finished speaking, which affects response time and the flow of the conversation.
  • Timestamp Accuracy: Accuracy in providing metadata such as the start and end time of each utterance.

Audio input requirements

Evaluate the model's ability to handle different audio qualities, microphone types, and background environments. Some models perform better with high-quality audio input, while others are more forgiving of variable conditions.

A critical feature for voice agents in public spaces, particularly for phone-based systems, is the ability to effectively filter and isolate target voices from background noise and other speakers.


Text-to-Speech (TTS) selection criteria

Voice quality and naturalness

The primary consideration for TTS is the naturalness of the generated speech. Consider whether the voices sound robotic or human-like, and how well they maintain consistency across longer passages. Special attention should be paid to the model's ability to maintain consistent sound quality when processing incomplete or partial text fragments.

The model should also include robust dictation mechanisms for accurately pronouncing specific formats like phone numbers, email addresses, and confirmation codes, ensuring clear and precise communication of critical information.

There is no common quality metric for Text To Speech services, similar to WER for STT models. However, AI model research platforms may have custom metrics, and you can start evaluating models from there.

For example, the Artificial Analysis platform collects user responses about the quality of popular TTS models and calculates the ELO Score based on them. At the beginning of 2025, the best performing TTS models have ELO scores around 1000–1100. And the rest of the models on the platform leaderboard have a score of around 850.

Voice customization options

Evaluate the available voice options and customization capabilities. Some businesses need multiple voices for different purposes, while others require brand-specific voice creation. 

Consider the model's ability to adjust speaking rate, pitch, and emphasis. For example, higher pitches often convey friendliness and approachability, while lower pitches project authority and seriousness.

Advanced systems also could offer control over specific voice characteristics:

  • Assertiveness: Controls the firmness of voice delivery.
  • Confidence: Affects how assured the voice sounds.
  • Smoothness: Adjusts between smooth and staccato delivery.
  • Relaxedness: Modifies tension in the voice.

The fine-grained control over voice generation, allowing developers to adjust intonation and pronunciation for specific words or phrases, is enabled by Speech Synthesis Markup Language. 

Language support

Support for multiple languages and regional variations ensures voice agents can serve diverse audiences. This includes:

  • Language selection
  • Regional accent configuration
  • Dialect-specific adjustments

Some models (e.g. cartesia-english) are configured for one language and if you need to switch to another language during the call, there will be problems. These problems could be hard to solve, because there is no real-time update of the call configuration.


Common criteria for both technologies

Both technologies need evaluation of their pricing models, including per-usage costs or subscription fees, scaling costs with increased usage, and additional fees for premium features or customization.

Also, the integration and technical requirements are essential. Check following points:

  • API documentation quality and ease of use
  • Development resources required
  • Compatibility with existing systems
  • Deployment options (cloud, on-premise, hybrid)
  • Service stability and guaranteed uptime
  • Update frequency and maintenance schedule

In addition, both technologies must meet your security standards, such as data handling and privacy practices, regulatory compliance capabilities, encryption standards, and access control features. Also, consider the quality and responsiveness of technical support.

A common pitfall when selecting STT/TTS models is to prioritize agent response speed or cost over quality performance across different environments without proper justification. In practice, conversations sometimes require giving users more time to think, and it is actually needed to slow down the agent's response rate.

Misplaced priorities in testing approaches can lead to situations where a voice agent performs brilliantly in controlled environments like homes or offices but struggles in noisy locations or with poor audio quality.


Best Speech-to-Text (Automatic Speech Recognition) models & providers in 2025

This list is from providers and models that are most commonly used for ASR tasks. But in order not to make the article endless, it is limited to five STT.

The main indicator by which the models below are compared is WER (Word Error Rate). As for speed metrics, not all providers specify these values. But you should also be attentive to WER: it can vary depending on the dataset, language, domain, etc. Moreover, WERs are usually measured in a pre-recorded voice rather than in a live stream. Therefore, when used in voice agents, the quality will usually be worse.

#1 Deepgram Nova-3

Nova-3 is Deepgram’s latest model, which was released in February 2025, as an upgraded version of their proprietary STT model Nova-2. 

The model achieve one of the best WER on the market: 6.84% (average across all domains). This number is relevant for streaming audio, and for the batch data (pre-recorded audio) it is even lower: 5.26%.

Nova-3 supports 36 languages and dialects and can switch between recognizing 10 different languages in real time. This means that if an English-speaking speaker throws in a couple of Spanish words, the model will interpret them correctly.

The Deepgram’s STT model costs $0.0077 per minute for streaming audio.


#2 OpenAI Whisper V3

The STT model from OpenAI, Whisper, was first introduced in September 2022 and has since been updated twice, in December 2022 (V2) and November 2023 (V3). The model is available in five variants: from tiny to large.

Whisper was designed to be as versatile as possible, working with 99 languages worldwide. But because of this, it is difficult to evaluate its effectiveness in each specific case. For instance, the WER for English can be as low as 5–6%, while for Scandinavian languages, it ranges from 8–10%. The average WER reported for Whisper is approximately 10.6%.

The model is effective in real-world scenarios, particularly in environments with background noise or heavy accents. However, there can be limitations regarding speed and operational complexity, especially when dealing with large volumes of audio data. 

OpenAI offers Whisper as an open-source model, meaning it is free to use. However, for those utilizing the Whisper API, pricing details may vary based on usage and specific implementation. It starts from $0.006 per minute.


#3 AssemblyAI Universal-2

AssemblyAI has one of the latest updates on the STT market: their Universal-2 came out in November 2024, as an improved version of the Universal-1 model. It comes in two options: Best and Nano. Nano is an advanced option that supports over 102 languages, while Best works with 20.

As of early 2025, the model from AssemblyAI can be considered one of the most accurate on the market. It has an average WER of 6.68%. Reviews write that Universal-2 copes especially well with medical and sales domains. Universal-2 employs an all-neural architecture for text formatting, significantly improving the readability of transcripts. This includes context-aware punctuation and casing.

AssemblyAI offers a pay-as-you-go pricing model starting at $0.00025 per second, which is approximately $0.015 per minute for transcription services. Additional costs may apply for advanced features like speaker detection or sentiment analysis.


#4 Speechmatics Ursa 2

The latest and most advanced STT model from Speechmatics is called Ursa 2. It was released in October 2024. The update added new languages (e.g. Arabic dialects), expanding the list to 50. The new version also improved the accuracy and speed of the model. In certain languages, such as Spanish and Polish, Ursa 2 is the market leader with 3.3% and 4.4% WER respectively.

Many users also highlight Ursa's superior handling of diverse accents and noisy environments. Though, the average WER for Ursa 2 is 8.6%.

Speechmatics operates on a subscription-based pricing model. The minimum price for using Speechmatics' Speech-to-Text services is approximately $0.0133 per minute for the Standard model in batch transcription, with higher rates for enhanced accuracy and real-time transcription options. Real-time transcription starts with $1.04 per hour, which equals approximately $0.0173 per minute.


#5 Google Speech-to-Text Chirp

Google Speech to Text service is part of the huge Google Cloud infrastructure. USM (Universal Speech Model) is responsible for speech recognition in it, but it is not one model, but a whole family. The most advanced model in this family is Chirp. It covers more than 125 languages in the world. And USM works with 309 in total.

Being part of Google means regular updates, the ability to train models on large amounts of data, and the interconnection of the service with other infrastructure applications. Also, Google Speech to Text has a good WER, but as with Whisper, it is highly language-dependent. The average WER is 8.5%.

The first 60 minutes per month are free. After that, pricing starts at approximately $0.024 to $0.036 per minute, depending on the features used (e.g., standard vs. enhanced models)

Comparison of top STTs in 2025:

Provider and Model

WER*

Languages

Cost for streaming audio (per minute)

Latency (real-time)

Deepgram Nova-3

6.84%

36

$0.0077

<300ms

AssemblyAI Universal-2

6.68%

102

$0.015

<1s

OpenAI Whisper

10.6%

99

$0.006

N/A

Speechmatics Ursa 2

8.6%

50

$0.0173

<1s

Google Speech-to-Text Chirp

8.5%

309

$0.024

N/A

*Average WER values officially declared by providers are indicated. The values may vary depending on the dataset in which the model is tested.


Best Text-to-Speech models & providers in 2025

Similar to the previous selection, this is a limited list. It includes those providers that show the best quality according to the Artificial Analysis Leaderboard in February 2025.

All the services listed below have good voice realism according to user reviews, but we recommend listening to their samples yourself. Emotionality or realism of voice is one of the most important criteria for evaluating TTS. Many people have developed the Uncanny Valley Effect, and chances are that the more realistic the voice, the better it affects the overall conversion rate or effectiveness of the voice agent.

#1 ElevenLabs Flash

The ElevenLabs provider for early 2025 offers two TTS models: Multilingual and Flash. Multilingual is optimized for maximum realism and humanness of the voice, while Flash is an ultra-fast model with ~75ms delay. For voice agents specifically, Flash is recommended. It is integrated into ElevenLabs' broader platform for building customizable interactive voice agents.

Flash from ElevenLabs work with 32 languages. Users can customize various aspects of the voice output, including tone and emotional expression. They also can clone voices. Business subscriptions for the models can be monthly or yearly and start at $275 per month (or $103 per 1 M characters).

💡
This and some other providers on the list do not disclose the price per 1M characters. In such cases, we will indicate the figure based on calculations from the Artificial Analysis platform.

The ElevenLabs Playground is available here, but you need to sign up to try it out. A basic version of the playground is also available on the provider's homepage, but with a limited selection of voices, no customization, and no option to choose a model.

There is also a Voice Library, where you can listen to all voices categorized by use case. Interestingly, not only the company's developers but also community members can upload voices to it.

Listen to how two popular voices (male and female) from the ElevenLabs library sound:

audio-thumbnail
ElevenLabs Charlie
0:00
/4.022813
audio-thumbnail
ElevenLabs Jessica
0:00
/3.47425

Here and below, the phrase "Hi! I’m a Text-to-Speech model, and this is what my voice sounds like." is used for testing. No additional voice settings have been applied.


#2 Cartesia Sonic

Cartesia offers only one TTS model, the Sonic. It is also quite fast and shows 90 ms latency, which is very good for conversations in real-time. Developers can customize voice attributes such as pitch, speed, and emotion, allowing for tailored speech outputs that meet specific needs. Cartesia's technology also allows models to run directly on devices.

Sonic supports instant voice cloning with minimal audio input (as little as 10 seconds), enabling users to replicate specific voices accurately. Sonic works with 15 languages. Monthly subscription for business starts at $49 or $46.7 per 1M characters.

The Cartesia Playground is available at this link, but it is only accessible to logged-in users. In the Voices section, you can browse the entire voice library.

Here’s how two voices from the Cartesia library sound: Help Desk Woman and Customer Service Man:

audio-thumbnail
Cartesia Help Desk Woman
0:00
/3.541043
audio-thumbnail
Cartesia Customer Service Man
0:00
/4.539501

#3 Google Text-to-Speech Studio

The Google TTS service has several models. Among them, the Studio model has the highest rating on the Artificial Analysis platform. Google does not provide specific latency figures for the model, but according to user feedback, it is around 500 ms.

This TTS supports over 380 voices across 50+ languages and variants. Users can create unique voices by recording samples, allowing brands to have a distinct voice across their customer interactions. The API supports SSML, enabling developers to control aspects like pitch, speed, volume, and pronunciation for more tailored speech outputs.

Google TTS offers a flexible pay-as-you-go pricing model. For the Studio model, charges starts from $160 per 1M characters.

You can try out Google's TTS service here. To do so, you need to be logged into your Google account. On the playground, you can choose between only two voices—male or female—but you can customize how they sound on different devices, such as a car speaker.

audio-thumbnail
Google female
0:00
/5.784
audio-thumbnail
Google male
0:00
/4.512

#4 Amazon Polly Generative

Amazon Polly is a TTS service from cloud provider Amazon that seamlessly integrates with other AWS services. Polly has four models: Generative, Long-Form, Neural, and Standard. The first one is considered the most advanced.

The service supports 34 languages as well as their dialects. One to 10+ voice options are offered for each language. A total of 96 voices (for all languages) are available to users.

Polly supports SSML, allowing developers to fine-tune speech output with specific instructions regarding pronunciation, volume, pitch, and speed. Users can create custom voices tailored to specific branding needs or preferences. The delay of Amazon Polly responses varies from 100ms to 1 second. The service operates on a pay-as-you-go pricing model. The minimal cost for business users of the Generative model is $30 per 1M characters.

You can try Amazon Polly through the AWS Console. To do so, you’ll need to sign up and log in to your account, including entering your credit card details—even if you don’t plan to use Amazon’s paid services. Note that the Generative model may not be available in some regions.

Here’s how it sounds:

audio-thumbnail
Amazon Ruth
0:00
/4.128
audio-thumbnail
Amazon Patrick
0:00
/4.848

#5 Microsoft Azure AI Speech Neural

Microsoft Azure TTS is a part of the Microsoft Azure ecosystem that integrates seamlessly with other Azure AI services. Microsoft's TTS model is called Neural, and it has several versions: Standard, Custom, and HD. The latter is the most advanced.

It supports over 140 languages and locales. Users can create custom neural voices tailored to their brand or application needs. Developers can use SSML to customize pronunciation, intonation, and other speech characteristics.

The service generally exhibits a latency of around less than 300 ms. Microsoft Azure TTS price starts at $15 per 1M characters. The HD version comes at a higher cost, but the provider does not specify an exact price—it is negotiated individually.

Microsoft's Voice Library is open and accessible even to guest users via this link. However, to try out the TTS functionality, you’ll need to sign in with an Azure account.

💡
The Azure AI Speech demo version has a limit of 12 audio play requests per minute.

Examples of Featured voices are below:

audio-thumbnail
Microsoft Andrew
0:00
/4.728
audio-thumbnail
Microsoft Ava
0:00
/4.944

#6 PlayHT Dialog

PlayHT provides two TTS models: PlayHT 3.0. and Dialog. Both of these models can be used for AI voice agents, but PlayHT Dialog is designed specifically for conversational applications. It was released in February 2025.

Dialog works with 9 main languages and 23 additional. The latency of the model is 300 ms and the main advantage is highly natural, expressive and fluid voices. There are more than 50 of them, and functionality of voice cloning is also available.

The cost of PlayHT TTS starts at $39 per month, but this plan is limited to 250,000 characters. The unlimited plan costs $99 per month. In terms of $ per 1M characters, it is $150.

PlayHT has an open playground where even unregistered users can explore its features. However, you’ll need to log in to generate speech.

Here’s how the voices that the platform suggests trying first sound:

audio-thumbnail
PlayHT Angelo
0:00
/3.144
audio-thumbnail
PlayHT Jennifer
0:00
/3.456

Comparison of top TTSs in 2025:

Provider and Model

ELO Score*

Languages

Cost (per 1M characters)

Latency

ElevenLabs Flash

1145.4

32

$103

75ms

Cartesia Sonic

1137.5

15

$46.7

90ms

Google Text-to-Speech Studio

1046.5

50+

$160

500ms

Amazon Polly Generative

1063.9

34

$30

100ms

Microsoft Azure Text to Speech

1054.5

140+

$15

300 ms

PlayHT Dialog

991

32

$150

303ms

*ELO Score specified on the Artificial Analysis platform on February 2025


Choosing the right STT and TTS models for your project

In general, there is this correlation of choice: Google, Microsoft and other large providers provide more stability, but smaller companies like ElevenLabs or Deepgram can offer more realistic voices in TTS and better speeds in STT.

Taras, CTO at Softcery, also shares his recommendations when choosing a model, depending on the use case:

  • For projects within entertainment or interactivity/gaming, emotionality and realism in voice (TTS) are important.
  • For commercial agents with expected high call volume, stability is important.
  • For agents for appointments at a medical center, hair salon, golf club, etc., low WER in STT is important, as contact information is often dictated there, and it is critical to transcribe it correctly.

Testing your chosen models in real-world conditions is crucial for success. While providers often showcase performance in ideal settings, voice agents frequently encounter challenging scenarios. Users may speak quietly or have speech impairments, heavy accents can affect recognition accuracy, and background noise or poor connections can significantly impact performance. Multiple speakers talking simultaneously or interrupting each other also present unique challenges that may not be apparent during initial testing.

When planning for long-term success, consider how your chosen solution will scale alongside your project. This means evaluating providers' capabilities to handle sudden traffic spikes and support multi-region deployment without compromising performance. Additionally, assess their roadmap for adding new languages and voice options, as well as their ability to integrate with an expanding tech stack.