Building a Serverless Text-to-Speech Application Using AWS Lambda, API Gateway, and Amazon Polly

Charith Herath
5 min readFeb 27, 2025

--

Introduction

In this article, we will explore the architecture and workflow of a serverless text-to-speech (TTS) conversion application built on AWS. This solution leverages AWS Lambda, API Gateway, Amazon Polly, Amazon S3, DynamoDB, and Amazon SQS to provide an asynchronous and scalable text-to-speech conversion service.

Architecture Diagram

Application Workflow

1. User Input via Frontend

The front end, hosted on AWS Amplify, provides users with an interface to input text and select the preferred voice gender (male or female). When the user submits the request, it is sent to Amazon API Gateway via an HTTP POST request.

2. Processing Request with Lambda and DynamoDB

  • API Gateway forwards the request to Lambda Function 1.
  • Lambda Function 1 generates a session ID and stores the request details in Amazon DynamoDB.
  • The function then sends the request to an Amazon SQS queue for asynchronous processing.
import json
import boto3
import uuid
import os
import logging
import time

logger = logging.getLogger()
logger.setLevel(logging.INFO)

dynamodb_table = os.getenv('DYNAMODB_TABLE')
queue_url = os.getenv('SQS_QUEUE_URL')

dynamodb = boto3.resource('dynamodb')
sqs = boto3.client('sqs')

max_retries = 5
wait_time = 5

def lambda_handler(event, context):
table = dynamodb.Table(dynamodb_table)


text = event.get('text', '')
choice_of_voice = event.get('choice_of_voice', '')
session_id = event.get('session_id')

if not session_id:
session_id = str(uuid.uuid4())

if not text or not choice_of_voice:
return {
'statusCode': 400,
'body': json.dumps({'error': 'Missing required fields: text or choice_of_voice'})
}


table.put_item(
Item={
'session_id': session_id,
'text': text,
'choice_of_voice': choice_of_voice
}
)


response = sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps({
'session_id': session_id,
'text': text,
'choice_of_voice': choice_of_voice
})
)


for i in range(max_retries):
db_response = table.get_item(
Key={
'session_id': session_id
}
)

if 'Item' in db_response:
item = db_response['Item']
if item.get('download_url'):
logger.info(json.dumps({
'message': "Thank you for your request. Your text is converted into voice",
'session_id': session_id,
'download_url': item['download_url']
}))
return {
'statusCode': 200,
'body': json.dumps({
'message': "Thank you for your request. Your text is converted into voice",
'session_id': session_id,
'download_url': item['download_url']
})
}

logger.info(f"Retry {i+1}/{max_retries}: Download URL not available. Retrying in {wait_time} seconds")
time.sleep(wait_time)

error_message = {
'message': "Text-to-speech conversion failed or took too long. Please try again later.",
'session_id': session_id
}

logger.error(json.dumps(error_message))

return {
'statusCode': 500,
'body': json.dumps(error_message)
}

3. Background Processing with SQS and Lambda

  • A second Lambda function (Lambda Function 2) is configured to poll the Amazon SQS queue.
  • When a message is received, Lambda Function 2:
  1. Retrieves the request details (session ID, text, and voice gender) from the SQS queue.
  2. Calls Amazon Polly, passing the text and the selected voice gender.
  3. Polly converts the text into an audio file and stores it in Amazon S3.
  4. A pre-signed URL is generated for the audio file.
  5. The pre-signed URL is then updated in the corresponding DynamoDB entry.
import json
import logging
import boto3
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

dynamodb = boto3.client('dynamodb')
polly_client = boto3.client('polly')

dynamodb_table = os.getenv("DYNAMODB_TABLE")

def lambda_handler(event, context):

for record in event['Records']:
sqs_message = json.loads(record['body'])
logger.info(f"Received message: {sqs_message}")
try:
session_id = sqs_message.get("session_id", "No session ID provided")
text = sqs_message.get("text", "No text provided")
choice_of_voice = sqs_message.get("choice_of_voice", "No voice gender provided")

logger.info(f"Extracted text: {text}")
logger.info(f"Extracted voice gender: {choice_of_voice}")

response_audio = polly_client.synthesize_speech(VoiceId=choice_of_voice, OutputFormat="mp3", Text=text)

audio_stream = response_audio.get("AudioStream")
if audio_stream:
s3_client = boto3.client('s3')
bucket_name = os.getenv("BUCKET_NAME")
object_key = f"{session_id}.mp3"
s3_client.put_object(Bucket=bucket_name, Key=object_key, Body=audio_stream.read())
audio_file = s3_client.get_object(Bucket=bucket_name, Key=object_key)
pre_signed_url = s3_client.generate_presigned_url('get_object', Params={'Bucket': bucket_name, 'Key': object_key}, ExpiresIn=3600)
logger.info(f"Audio stream saved to S3 bucket: {bucket_name}, Key: {object_key}")

try:
response_update_db = dynamodb.update_item(
TableName=dynamodb_table,
Key={'session_id': {'S': session_id}},
UpdateExpression='SET download_url = :pre_signed_url',
ExpressionAttributeValues={':pre_signed_url': {'S': pre_signed_url}}
)

print("Update succeeded:", response_update_db)


except Exception as e:
print("Update failed:", e)
return None


else:
logger.error("Failed to get audio stream from Polly response")



except json.JSONDecodeError:
logger.error("Failed to parse SQS message as JSON")
continue

return {
'statusCode': 200,
'body': "Conversion Successful"
}

4. Retrieving the Pre-Signed URL

  • Lambda Function 1 waits for a short delay (5 seconds) and then queries DynamoDB to check if the pre-signed URL is available.
  • If the URL is present, it is included in the response and sent back through API Gateway to the frontend. If not, retry up to a maximum number of 5 retries, with each retry waiting for 5 seconds.

5. Frontend Download Button

  • The frontend receives the response containing the pre-signed URL.
  • It provides a download button for the user to retrieve the converted audio file.

AWS Services Used

1. Amazon API Gateway

Acts as the interface between the frontend and backend, routing requests to AWS Lambda.

2. AWS Lambda

  • Lambda Function 1: Processes initial requests, stores data in DynamoDB, and sends messages to SQS.
  • Lambda Function 2: Polls the SQS queue, processes the request, calls Polly, and updates DynamoDB.

3. Amazon DynamoDB

Stores request details and pre-signed URLs for retrieval.

4. Amazon SQS

Decouples processing by enabling asynchronous execution of text-to-speech conversion.

5. Amazon Polly

Converts text into natural-sounding speech.

6. Amazon S3

Stores the generated audio files and provides a secure way to access them via pre-signed URLs.

7. AWS Amplify

Hosts the frontend UI that allows users to submit text and retrieve the generated audio file.

Advantages of This Architecture

  • Scalability: AWS Lambda and SQS allow for the efficient handling of multiple requests simultaneously.
  • Cost-Effectiveness: The serverless model ensures you only pay for actual compute usage.
  • Asynchronous Processing: SQS decouples request processing, preventing delays in frontend responsiveness.
  • Security: Pre-signed URLs restrict access to generated audio files, ensuring user privacy.

Future Improvements

  • Implement WebSocket API for real-time updates instead of polling with a delay.
  • Enable user authentication to track request history and usage.
  • Optimize DynamoDB queries for better performance at scale.

Conclusion

This AWS-powered serverless TTS application demonstrates a robust, scalable, and cost-efficient solution for text-to-speech conversion. By leveraging cloud-native services, we ensure a seamless experience for users while maintaining an efficient backend workflow.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Charith Herath
Charith Herath

Written by Charith Herath

BSc (Hons) Electrical and Electronic Engineering | CCNP | CCNA | Cloud Enthusiast

No responses yet

Write a response