Building a Serverless Text-to-Speech Application Using AWS Lambda, API Gateway, and Amazon Polly

5 min readFeb 27, 2025

Introduction

In this article, we will explore the architecture and workflow of a serverless text-to-speech (TTS) conversion application built on AWS. This solution leverages AWS Lambda, API Gateway, Amazon Polly, Amazon S3, DynamoDB, and Amazon SQS to provide an asynchronous and scalable text-to-speech conversion service.

Architecture Diagram

Application Workflow

1. User Input via Frontend

The front end, hosted on AWS Amplify, provides users with an interface to input text and select the preferred voice gender (male or female). When the user submits the request, it is sent to Amazon API Gateway via an HTTP POST request.

2. Processing Request with Lambda and DynamoDB

API Gateway forwards the request to Lambda Function 1.
Lambda Function 1 generates a session ID and stores the request details in Amazon DynamoDB.
The function then sends the request to an Amazon SQS queue for asynchronous processing.

import json
import boto3
import uuid
import os
import logging
import time

logger = logging.getLogger()
logger.setLevel(logging.INFO)

dynamodb_table = os.getenv('DYNAMODB_TABLE')
queue_url = os.getenv('SQS_QUEUE_URL')

dynamodb = boto3.resource('dynamodb')
sqs = boto3.client('sqs')

max_retries = 5
wait_time = 5

def lambda_handler(event, context):
    table = dynamodb.Table(dynamodb_table)


    text = event.get('text', '')
    choice_of_voice = event.get('choice_of_voice', '')
    session_id = event.get('session_id')

    if not session_id:
        session_id = str(uuid.uuid4())    

    if not text or not choice_of_voice:
        return {
            'statusCode': 400,
            'body': json.dumps({'error': 'Missing required fields: text or choice_of_voice'})
        }    


    table.put_item(
        Item={
            'session_id': session_id,
            'text': text,
            'choice_of_voice': choice_of_voice
        }
    )

    
    response = sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=json.dumps({
            'session_id': session_id,
            'text': text,
            'choice_of_voice': choice_of_voice
        })
    )


    for i in range(max_retries):
        db_response = table.get_item(
            Key={
                'session_id': session_id
            }
        )

        if 'Item' in db_response:
            item = db_response['Item']
            if item.get('download_url'):
                logger.info(json.dumps({
                    'message': "Thank you for your request. Your text is converted into voice",
                    'session_id': session_id,
                    'download_url': item['download_url']
                }))
                return {
                    'statusCode': 200,
                    'body': json.dumps({
                        'message': "Thank you for your request. Your text is converted into voice",
                        'session_id': session_id,
                        'download_url': item['download_url']
                    })
                }
        
        logger.info(f"Retry {i+1}/{max_retries}: Download URL not available. Retrying in {wait_time} seconds")
        time.sleep(wait_time)
    
    error_message = {
        'message': "Text-to-speech conversion failed or took too long. Please try again later.",
        'session_id': session_id
    }

    logger.error(json.dumps(error_message))

    return {
        'statusCode': 500,
        'body': json.dumps(error_message)
    }

3. Background Processing with SQS and Lambda

A second Lambda function (Lambda Function 2) is configured to poll the Amazon SQS queue.
When a message is received, Lambda Function 2:

Retrieves the request details (session ID, text, and voice gender) from the SQS queue.
Calls Amazon Polly, passing the text and the selected voice gender.
Polly converts the text into an audio file and stores it in Amazon S3.
A pre-signed URL is generated for the audio file.
The pre-signed URL is then updated in the corresponding DynamoDB entry.

import json
import logging
import boto3
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

dynamodb = boto3.client('dynamodb')
polly_client = boto3.client('polly')

dynamodb_table = os.getenv("DYNAMODB_TABLE")

def lambda_handler(event, context):

    for record in event['Records']:
        sqs_message = json.loads(record['body'])
        logger.info(f"Received message: {sqs_message}")
        try:
            session_id = sqs_message.get("session_id", "No session ID provided")
            text = sqs_message.get("text", "No text provided")
            choice_of_voice = sqs_message.get("choice_of_voice", "No voice gender provided")
                
            logger.info(f"Extracted text: {text}")
            logger.info(f"Extracted voice gender: {choice_of_voice}")

            response_audio = polly_client.synthesize_speech(VoiceId=choice_of_voice, OutputFormat="mp3", Text=text)

            audio_stream = response_audio.get("AudioStream")
            if audio_stream:
                s3_client = boto3.client('s3')
                bucket_name = os.getenv("BUCKET_NAME")
                object_key = f"{session_id}.mp3"
                s3_client.put_object(Bucket=bucket_name, Key=object_key, Body=audio_stream.read())
                audio_file = s3_client.get_object(Bucket=bucket_name, Key=object_key)
                pre_signed_url = s3_client.generate_presigned_url('get_object', Params={'Bucket': bucket_name, 'Key': object_key}, ExpiresIn=3600)
                logger.info(f"Audio stream saved to S3 bucket: {bucket_name}, Key: {object_key}")

                try:
                    response_update_db = dynamodb.update_item(
                        TableName=dynamodb_table,
                        Key={'session_id': {'S': session_id}},
                        UpdateExpression='SET download_url = :pre_signed_url',
                        ExpressionAttributeValues={':pre_signed_url': {'S': pre_signed_url}}
                    )

                    print("Update succeeded:", response_update_db)


                except Exception as e:
                    print("Update failed:", e)
                    return None
                
            
            else:
                logger.error("Failed to get audio stream from Polly response")

            

        except json.JSONDecodeError:
            logger.error("Failed to parse SQS message as JSON")
            continue

    return {
        'statusCode': 200,
        'body': "Conversion Successful"
    }

4. Retrieving the Pre-Signed URL

Lambda Function 1 waits for a short delay (5 seconds) and then queries DynamoDB to check if the pre-signed URL is available.
If the URL is present, it is included in the response and sent back through API Gateway to the frontend. If not, retry up to a maximum number of 5 retries, with each retry waiting for 5 seconds.

5. Frontend Download Button

The frontend receives the response containing the pre-signed URL.
It provides a download button for the user to retrieve the converted audio file.

AWS Services Used

1. Amazon API Gateway

Acts as the interface between the frontend and backend, routing requests to AWS Lambda.

2. AWS Lambda

Lambda Function 1: Processes initial requests, stores data in DynamoDB, and sends messages to SQS.
Lambda Function 2: Polls the SQS queue, processes the request, calls Polly, and updates DynamoDB.

3. Amazon DynamoDB

Stores request details and pre-signed URLs for retrieval.

4. Amazon SQS

Decouples processing by enabling asynchronous execution of text-to-speech conversion.

5. Amazon Polly

Converts text into natural-sounding speech.

6. Amazon S3

Stores the generated audio files and provides a secure way to access them via pre-signed URLs.

7. AWS Amplify

Hosts the frontend UI that allows users to submit text and retrieve the generated audio file.

Advantages of This Architecture

Scalability: AWS Lambda and SQS allow for the efficient handling of multiple requests simultaneously.
Cost-Effectiveness: The serverless model ensures you only pay for actual compute usage.
Asynchronous Processing: SQS decouples request processing, preventing delays in frontend responsiveness.
Security: Pre-signed URLs restrict access to generated audio files, ensuring user privacy.

Future Improvements

Implement WebSocket API for real-time updates instead of polling with a delay.
Enable user authentication to track request history and usage.
Optimize DynamoDB queries for better performance at scale.

Conclusion

This AWS-powered serverless TTS application demonstrates a robust, scalable, and cost-efficient solution for text-to-speech conversion. By leveraging cloud-native services, we ensure a seamless experience for users while maintaining an efficient backend workflow.