Overview
AWS Polly is a text-to-speech (TTS) cloud service that enables developers to convert textual input into spoken audio. The service provides a selection of voices across various languages, including both standard parametric synthesis voices and Neural Text-to-Speech (NTTS) voices, which are designed to offer more natural and human-like intonation and pronunciation [docs.aws.amazon.com]. This capability allows applications to deliver spoken feedback, create audio versions of written content, and support accessibility features for users with visual impairments or reading difficulties.
Polly operates on a pay-as-you-go model, where costs are determined by the number of characters processed. This pricing structure makes it suitable for both small-scale projects and large-volume content generation. The service is integrated with the broader AWS ecosystem, allowing it to be combined with other AWS services like Amazon S3 for storage of generated audio files, or AWS Lambda for event-driven audio synthesis.
Developers primarily use AWS Polly to enhance user experience in applications that require voice interaction, such as voice assistants, interactive voice response (IVR) systems, and navigation apps. It is also utilized for generating voiceovers for videos, e-learning modules, and podcasts, streamlining the content creation workflow by automating the voice narration component. For content creators, Polly's ability to produce long-form audio content can reduce the time and cost associated with professional voice acting. The service offers both synchronous synthesis for real-time applications and asynchronous batch synthesis for larger files, providing flexibility for different use cases [docs.aws.amazon.com]. Its compliance certifications, including HIPAA eligibility and GDPR readiness, make it suitable for regulated industries that handle sensitive information.
Key features
- Standard Voices (Neural TTS): Offers a range of male and female voices across multiple languages, utilizing neural text-to-speech technology for improved naturalness and expressiveness.
- Neural Voices (NTTS): Provides a higher quality of speech synthesis, designed to produce more lifelike and human-sounding audio, particularly suitable for customer-facing applications and media.
- Long-form Voices (LF-NTTS): Optimized for synthesizing longer pieces of content, such as audiobooks and articles, maintaining consistent voice quality and intonation over extended passages.
- Speech Marks: Provides metadata about the timing of spoken words, sentences, and phonemes. This feature is useful for synchronizing speech with visual elements, such as animating characters or highlighting text during playback [docs.aws.amazon.com].
- SSML Support: Supports Speech Synthesis Markup Language (SSML), allowing developers to control aspects of speech such as pronunciation, volume, pitch, and speaking rate [docs.aws.amazon.com].
- Custom Lexicons: Enables users to define custom pronunciations for specific words, acronyms, or foreign terms, ensuring consistent and accurate speech output.
- Asynchronous Synthesis: For large text inputs, Polly supports asynchronous synthesis, allowing users to submit text and retrieve the audio output once processing is complete, suitable for batch jobs.
Pricing
AWS Polly offers a pay-as-you-go pricing model based on the number of characters processed. There is a free tier available for new AWS customers.
| Service Tier | Description | Price (per 1 million characters) | As of Date |
|---|---|---|---|
| Free Tier (New Customers) | Standard Voices | 5 million characters/month for 12 months | 2026-05-28 |
| Free Tier (New Customers) | Neural Voices | 1 million characters/month for 12 months | 2026-05-28 |
| Standard Voices | On-demand processing | $4.00 | 2026-05-28 |
| Neural Voices | On-demand processing | $16.00 | 2026-05-28 |
Detailed and up-to-date pricing information is available on the AWS Polly pricing page.
Common integrations
- AWS SDK for Python (Boto3): Allows Python developers to interact with Polly for text-to-speech synthesis within their applications [docs.aws.amazon.com].
- AWS SDK for Java: Provides Java APIs for integrating Polly's speech capabilities into Java applications [docs.aws.amazon.com].
- AWS SDK for JavaScript: Enables client-side and Node.js applications to use Polly for generating speech [docs.aws.amazon.com].
- AWS SDK for .NET: Facilitates the integration of Polly into .NET applications and services [docs.aws.amazon.com].
- AWS Command Line Interface (CLI): Allows for direct interaction with Polly via command-line commands for scripting and automation of text-to-speech tasks [docs.aws.amazon.com].
- Amazon S3: Generated audio files from Polly can be directly stored in Amazon S3 buckets for persistent storage and distribution.
- AWS Lambda: Polly can be triggered by serverless functions in AWS Lambda to process text and generate speech in response to events.
- Amazon Translate: Can be used in conjunction with Amazon Translate to first translate text into a target language, and then synthesize it into speech using Polly.
Alternatives
- Google Cloud Text-to-Speech: Offers a range of voices and languages with advanced customization, including WaveNet technology for highly natural speech.
- Microsoft Azure Text to Speech: Provides customizable neural voices, supporting many languages and offering fine-grained control over speech output.
- ElevenLabs: Focuses on realistic and expressive speech synthesis, specializing in long-form content and voice cloning.
Getting started
To get started with AWS Polly using the AWS SDK for Python (Boto3), you'll need to install the SDK and configure your AWS credentials. The following example demonstrates how to synthesize text into an MP3 audio file.
import boto3
from botocore.exceptions import ClientError
# Create a client using the default credentials
polly_client = boto3.client('polly')
text_to_synthesize = "Hello, this is AWS Polly. I can convert your text into lifelike speech."
output_filename = "hello_polly.mp3"
try:
response = polly_client.synthesize_speech(
Text=text_to_synthesize,
OutputFormat='mp3',
VoiceId='Joanna' # Or any other available VoiceId like 'Matthew', 'Amy', etc.
)
# The audio stream is in the 'AudioStream' field of the response
if "AudioStream" in response:
with open(output_filename, 'wb') as file:
file.write(response['AudioStream'].read())
print(f"Audio content saved to {output_filename}")
else:
print("Could not find AudioStream in response.")
except ClientError as e:
print(f"Error synthesizing speech: {e}")
This Python script initializes a Polly client, specifies the text to be converted, chooses an output format (MP3), and selects a voice (e.g., 'Joanna'). It then calls the synthesize_speech method, reads the audio stream from the response, and saves it to a local MP3 file. For more detailed examples and language-specific SDK documentation, refer to the AWS Polly documentation.