Technologies
Amazon Polly
Amazon Polly is another service from the AWS suite that uses deep learning mechanisms to convert text to speech (TTS). The text-to-speech technology converts various types of content into realistic and lifelike synthetic speech.
Organizations can use it to read articles on a website, narrate audiobooks, or create a customer service assistant.
The Amazon Polly service offers four voice engines that differ in sophistication. They range from a standard level of synthetic speech—the standard engine—to more advanced engines like the neural engine, long-form engine, and generative engine.
What is Amazon Polly?
As mentioned in the introduction, Amazon Polly is an advanced speech synthesizer. It's a cloud service that converts written text into speech that sounds as natural as possible. Amazon Polly allows organizations to increase engagement and accessibility of the offered content.
The service provides access to various voices (male and female), and their intonation can be customized. This allows companies to adapt the speech synthesizer to business needs. Additionally, users have access to 4 different voice engines.
Voice engines offered by Amazon Polly
Amazon Polly's offer includes access to 4 voice engines. They differ in how natural the synthesized voice sounds and the number of available languages.
Standard Engine
The standard engine is Amazon Polly's most basic voice engine. It also supports the largest number of languages. With it, you can use 40 languages and their variants (British English, American English, Australian English, etc.).
The standard engine combines the phonemes of recorded speech and produces natural sounds. You can choose from 40 female voices and 20 male voices. For example, for American English, you can select the voices of Ivy or Kevin.
Neural Engine
The neural engine is more advanced than the standard engine and uses more sophisticated technology to produce speech. To better imitate the complexity of human speech, the neural engine uses a neural network and a vocoder. The neural network converts a sequence of phonemes into a sequence of spectrograms so that the vocoder can transform it into an audio signal. As a result, Amazon Polly can produce high-quality speech, which is more similar to natural human speech.
The neural engine supports 35 languages. American users can choose from 13 different voices (male and female).
Long-form Engine
The long-form engine is designed to synthesize speech to maintain listeners' attention for longer content (articles on blogs, audiobooks, etc.). With the help of this engine, organizations can create a voice that is more human-like, expressive, and sensitive to emotions.
The engine can combine phonetic and acoustic elements of human language to produce natural-sounding speech. Using text embedding, users can instruct the engine and give it directions for intonating and emphasizing specific words. They can also indicate where the engine should make appropriate pauses.
This voice engine is currently available only for English and Spanish. Each language has a couple of voice variants.
Generative Engine
The generative engine is the most advanced voice engine the Amazon Polly offers. The speech generated by this engine is the most similar to human language, mimicking its emotionality and being characterized by adaptability.
The engine uses a transformer to generate speech by processing a billion parameters. This allows it to learn on larger voice data for producing speech. Thanks to that, you can use it for use cases where a highly natural-sounding voice is essential (e.g., customer service).
The generative engine supports English, French, Spanish, German, and Italian. Each language has its own voice variants.
Amazon Polly and Amazon Lex — a duo that creates a comprehensive conversational interface
Amazon Polly works well with other AWS services, including Amazon Lex. Amazon Lex processes text and automatically recognizes speech, making it possible to integrate a conversational interface into an application. Thanks to Amazon Lex's ability to understand natural language, it can carry a conversation in a human-like manner. It recognizes the user's intentions and accurately adapts its answers.
Amazon Lex uses Amazon Polly to synthesize speech, so these services are almost inseparable. This allows companies to create applications that communicate with users in multiple languages and speech styles.
Amazon Polly isn't a computer-generated voice; it uses recorded speech sequences of real people who are native speakers of their respective languages. As a result, the voices available within the service are very similar to human voices. Thanks to this, customers can feel like they're talking to or listening to a real human being.
Advantages of Amazon Polly
Amazon Polly also has the following advantages:
High-quality voice synthesis
Amazon Polly offers high-quality text-to-speech conversion. With the option to choose various voice engines, organizations can adapt the capabilities of their speech synthesizer to business needs. Moreover, Amazon Polly's engines are characterized by a clear pronunciation of plain texts as well as acronyms, numbers, dates, abbreviations, etc.
Choice of many languages and voices
Amazon Polly supports 40 languages in total, including their variants (e.g., American/British English). Depending on the chosen voice engine, users have access to a different range of languages. English and Spanish work with all available engines. Each language also has a couple of female and male voices to choose from.
Moreover, Amazon Polly allows you to choose the style of speech. For example, for the English language, you can select a "newscaster" voice that will read the content in a newscaster fashion.
Integration with the AWS cloud
Converting text to speech requires a lot of computing resources, especially for big projects. However, the AWS cloud handles this problem and provides the cloud environment. Additionally, it allows Amazon Polly to integrate with other Amazon services such as Amazon Lex, Amazon Connect, or Amazon S3.
Low latency
Amazon Polly responds extremely quickly and works well for use cases involving conversations with users.
What are the use cases of Amazon Polly?
You can use Amazon Polly wherever the conversion of text to speech will increase customer engagement and improve the accessibility of products. Below, we describe a few examples.
Creating content
Amazon Polly can increase the accessibility of content for people with disabilities or who prefer listening to longer texts instead of reading them. You can use the service to read blog articles and create audiobooks. For example, the Washington Post and USA Today use Amazon Polly.
Newscasts
You can use Amazon Polly to read text like a professional newscaster if you're writing news articles in American/British English or Spanish. This will increase audience engagement and allow you to convert text into an engaging video with narration.
Customer service
Amazon Polly, combined with Amazon Connect and Amazon Lex, can create an advanced customer service assistant. Using more natural voices in the customer service industry reduces customer frustration and allows you to solve their problems faster.
Education
Amazon Polly speech synthesizer is also suitable for educational use. It allows you to create tutorials or other educational videos quickly. A good example is the Duolingo application, which uses Amazon Polly to convert text to speech. The service also enables you to create a voice-over for animations, allowing you to create educational content for children.
Summary
Amazon Polly is an AWS service that converts text to speech. It can create natural-sounding synthetic speech that doesn't sound like a robot. Thanks to deep learning, Amazon Polly produces high-quality speech that mimics how people speak.
Amazon Polly offers various voice engines for many different use cases. Depending on business needs, organizations can create virtual customer service assistants, applications, and educational videos and ensure access to content for a broad audience.