Automatic Dubbing System: How to Create a Text-to-Speech System

Guide to creating text-to-speech (TTS) functionality for an automatic dubbing system.
By Boris Delovski • Updated on Feb 24, 2025
blog image

Creating a synthetic voice to read text may seem complicated, but it is quite straightforward. While building such a model from scratch can be complex, we simplify the process by using a pre-trained model. This reduces the task to merely selecting a model that produces a voice we like. Once we have chosen the desired voice, it can be used to narrate the translation generated in the previous step of our automatic dubbing pipeline. That is essentially the core of the process. However, personal preference is not the only factor to consider when choosing a voice. Other important criteria must also be taken into account, which will be discussed further in this article.

What Is Text-to-Speech?

Creating a synthetic voice that reads text aloud is known as developing a text-to-speech (TTS) model. This technology has evolved rapidly, becoming an essential tool across various industries. Originally intended to enhance accessibility, TTS systems helped individuals with visual impairments to interact more easily with digital content. However, as the technology advanced, its applications expanded far beyond accessibility. Therefore, TTS is now widely used in entertainment, education, customer service, communication, and many other fields. 

Modern TTS systems utilize neural networks to generate synthetic voices that closely mimic human speech, capturing nuances like intonation, pitch, and even emotion. Users can nowadays select voices tailored to specific contexts, accents, or even speaking styles. This provides TTS systems with a level of flexibility that did not exist before.

With further advancements in artificial intelligence, TTS systems are expected to become even more refined, producing context-aware, and emotionally expressive speech. This will further blur the line between human and machine-generated voices. The future of TTS holds the promise of synthetic speech that sounds indistinguishable from human conversation.

Article continues below

What Are the Most Popular Paid TTS Tools

There is an abundance of popular TTS tools available today. In this series of articles, I will focus on using a model from Hugging Face to generate synthetic voices. However, it would be remiss not to mention the most popular paid tools currently on the market.

Any tool from this list can easily be integrated into the automatic dubbing pipeline and, in most cases, may even improve the overall quality. After all, the reason behind the price tag is the advanced capabilities they offer, often outperforming free, open-source models.

The first tool worth highlighting is ElevenLabs' TTS system, which is widely regarded as the top option available. It offers both free and paid versions. This tool stands out for producing exceptionally high-quality and realistic voices. A key feature is its ability to customize tone and emotional expression. In addition, it supports a great number of languages, including those spoken by smaller communities.

ElevenLabs is primarily used for applications like audiobooks, podcasts, and advertisements. Despite its premium performance, ElevenLabs remains reasonably priced. The platform offers free plans for users who want to explore its features before committing to a paid option. Considering its powerful capabilities and affordability, it is no surprise that ElevenLabs is a leading choice in today's TTS market.

Murf AI is another tool worth mentioning. Murf AI is a versatile text-to-speech tool that is highly regarded for its customization capabilities. It offers a wide range of voices, which can be tailored in terms of pitch, speed, and tone to create truly dynamic audio output. One of its remarkable features is the ability to sync voiceovers with video, making it particularly useful for multimedia projects. As a result, Murf is often used in professional contexts, such as creating voiceovers for podcasts, e-learning courses, and advertisements. 

Despite its advanced features, Murf remains accessible with a free plan for casual users, and paid plans starting at $23/month. This makes it an affordable option for those needing high-quality TTS solutions. While Murf AI is undoubtedly a powerful tool, I still believe that ElevenLabs outperforms it.

Finally, we must mention Play HT, a TTS platform that excels in lifelike audio generation, capturing human intonation and emotion remarkably well. Play HT offers control over pitch, speed, emphasis, and pauses, making it a robust option for those seeking highly customizable audio. While its customization features may be slightly less extensive than those of competitors like Murf, the platform's intuitive interface makes it ideal for beginners and professionals alike.

Play HT supports a variety of voices and accents, making it suitable for global applications. Pricing starts at $14.25/month, with various options depending on usage needs. 

Before moving on to the next chapter in this article, I do have a few honorable mentions here. Notably, I am talking about the TTS tools offered by the three largest cloud platforms: 

  • Amazon Web Services: Offers Amazon Polly, a robust text-to-speech service.
  • Microsoft Azure: Features Azure AI Speech, a comprehensive tool for speech-to-text, translation, and text-to-speech.
  • Google Cloud: Provides a reliable TTS service that users can pay for and integrate into their systems.

While these services might not directly compete with specialized TTS providers, they still deliver excellent quality. If you plan to build an automated dubbing system on one of these cloud platforms, they make a great choice. The slight reduction in voice quality is often outweighed by the benefits of keeping everything under one provider. This ensures easy integration and a smooth, intuitive workflow.

What Are the Most Popular Free TTS Tools

While paid tools generally outperform free ones, this does not mean that free options are subpar. In fact, some free solutions deliver performance comparable to that of paid tools, allowing users to avoid subscription costs altogether. However, it is important to clarify that when we refer to "free tools," it is often more accurate to label them as models rather than fully developed tools.

So, what is the difference between a model and a tool?

A tool typically refers to a complete platform or service that provides a user-friendly interface, integration options, and additional features like real-time processing, customization, and cloud support. These tools are designed to provide a seamless user experience, as demonstrated by ElevenLabs, Murf, or Speechify, which allow users to easily generate high-quality voices with minimal effort.

In contrast, a model is generally a pre-trained neural network capable of generating text-to-speech output, but it requires more technical expertise to implement. Models like Mozilla’s TTS, Coqui TTS, and others are open-source and free to use. However, to set them up you need to have programming knowledge and model deployment skills.

We can also categorize "free" models into two types: those that allow commercial use and those that do not. Depending on the licensing terms, some models permit commercial use of the generated content, while others restrict usage to research or similar non-commercial purposes. The highest quality models currently available on HuggingFace, and the one that I will use in this article are the models created as part of the MMS project.

Massively Multilingual Speech (MMS) Project Models

The Massively Multilingual Speech (MMS) project developed by Meta AI, aims to expand speech technology to support over 1,000 languages. This initiative addresses the limitations of current speech systems that only cover about 100 languages. The models created in this project cover 1107 languages for speech recognition and synthesis, and 4017 models for language identification.

The results show significant performance improvements compared to existing models like Whisper. While Whisper performs exceptionally well with English, its performance declines for other languages. As part of the MMS project, various TTS models were developed for various languages. These models are extremely high-quality but they are not available for commercial use.

To be more precise, the TTS models are covered under the Creative Commons Attribution NonCommercial 4.0 license. This means that we cannot use these models in any systems intended for commercial purposes. Therefore, while I will demonstrate how to use this model to generate audio, you will not be able to use it commercially. However, that should not pose a significant problem. I will use the model uploaded to HuggingFace, allowing you to easily replace it with any other TTS model from Hugging Face when creating a class that serves as the TTS component of our pipeline.

The code we will use to create a class that serves as the TTS component of our pipeline, generating a synthetic voice to read aloud the text we translated earlier, looks like this:

import torch
from transformers import VitsModel, AutoTokenizer
import soundfile as sf


class TextToSpeech:
    def __init__(self, model_name="facebook/mms-tts-fra"):
        """
        Initialize the TextToSpeech class with the given model and tokenizer.

        Args:
            model_name (str): The name of the model to load from the Hugging Face hub.
        """
        # Load the TTS model and tokenizer
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = VitsModel.from_pretrained(model_name).to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    def synthesize_speech(self, text, output_path):
        """
        Synthesize speech from text and save it to a file.

        Args:
            text (str): The text to convert into speech.
            output_path (str): The file path to save the synthesized audio.
        """
        # Tokenize the input text
        inputs = self.tokenizer(text, return_tensors="pt").to(self.device)

        # Generate speech (waveform) from the input text
        with torch.no_grad():
            output = self.model(**inputs).waveform

        # Save the synthesized audio to the output path
        sf.write(output_path, output.squeeze().cpu().numpy(), self.model.config.sampling_rate)

The code above uses three libraries:

  • PyTorch
  • transformers
  • soundfile

We structure our code as a class to enhance reusability and modularity, ensuring that it can easily integrate into our automatic dubbing pipeline. The TextToSpeech class accepts an optional argument model_name, which is the name of the pre-trained TTS model from the Hugging Face model hub. 

class TextToSpeech:
    def __init__(self, model_name="facebook/mms-tts-fra"):

By default, the model name is set to "facebook/mms-tts-fra". This TTS model from the MMS project is specifically designed to generate a synthetic French voice that can articulate any text you provide.

In the __init__ method of the class, we first set the device on which we will run the model. 

        # Initialize the TextToSpeech class with the given model and tokenizer.
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = VitsModel.from_pretrained(model_name).to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name) 

If the user has a GPU, the model will default to using it. Otherwise, the model will run on the CPU. To define the model, we use the from_pretrained method of the VitsModel that we imported earlier. This method allows us to load the TTS model from the MMS project directly from the Hugging Face model hub. This also enables us to use it later when we define the synthesize_speech method. 

Simultaneously we also need to load the tokenizer, which converts our text into tokens that the model can process as input. For this purpose, we can use the AutoTokenizer class from the transformers library, specifically its from_pretrained method. This automatically loads the appropriate tokenizer for a particular model from the model hub. Moreover, it makes sure that we select the correct one without the risk of making a mistake.

The final method we need to cover is the synthesize_speech method. This method uses the model that we loaded earlier to generate synthetic speech and subsequently saves the generated sound as a file on our PC. 

The speech synthesis process involves first tokenizing the input text into a tensor to define the inputs, followed by passing these tokenized inputs to the model. 

# Tokenize the input text 
inputs = self.tokenizer(text, return_tensors="pt").to(self.device)
# Generate speech (waveform) from the input text 
with torch.no_grad(): 
	output = self.model(**inputs).waveform

To obtain the waveform we are interested in from the model, we need to access the .waveform attribute of the generated output. Once we have this waveform, we can save the generated soundfile to our PC, usually in the form of a WAV file. To save the generated audio, we will use the soundfile library, specifically the write function from that library.

        # Save the synthesized audio to the output path
        sf.write(
		output_path,
		output.squeeze().cpu().numpy(), 
		self.model.config.sampling_rate)

Three things happen here:

  • output.squeeze(): Removes any extra dimensions from the waveform tensor, to ensure the waveform has the correct shape before saving.
  • cpu().numpy(): Moves the waveform tensor to the CPU and converts it to a NumPy array, because this function expects a NumPy array, not a PyTorch tensor.
  • self.model.config.sampling_rate: Uses the sampling rate from the model's configuration to correctly save the waveform, which is important for playback quality.

After this detailed breakdown of the class, let's demonstrate how to use it. Last time, we translated a sentence from English to French. After translation, we ended up with the following French sentence:

Ceci est un tutoriel sur la création d'un système de doublage automatique

Now let's use the class we just created to invoke the TTS model and have it generate an audio file that contains a synthetic voice saying this sentence. To do so, I am going to create an instance of the class. Afterward, I am going to use the synthesize_speech method of that newly generated object to create my WAV file.

# Create an instance of the TTS class
tts = TextToSpeech()

# Use the generated object to create a WAW file
tts.synthesize_speech(
    text="Ceci est un tutoriel sur la création d'un système de doublage automatique",
    output_path="dubbed_sentence.wav"
)

By running this code, you will generate a WAV file in the same directory from which you are running it. This WAV file, called dubbed_sentence.wav, will contain audio of a person speaking the translated sentence.

This article explored how to create a synthetic voice using a text-to-speech (TTS) system and integrate it into an automatic dubbing pipeline. By using a pre-trained model, we simplified the process, focusing on selecting the best voice and generating audio that matches the translated text. Tools like the MMS project models or paid options like ElevenLabs help us produce realistic, high-quality voices that bring dubbed content to life. However, we are not finished yet. The final step in our automatic dubbing pipeline is to ensure that the dubbed audio syncs perfectly with the original video. The next article in this series will explore the process of syncing the generated audio with the visual content to create a seamless viewing experience.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.