Automatic Dubbing System: How Preprocessing and Transcription Work

How preprocessing and transcription work in automatic dubbing systems.

A quick breakdown of all the components of an auto-dubbing pipeline was already provided in the previous article of this series. This new article will focus on the first two components: preprocessing and transcription. Both components are immensely significant—especially transcription.

As you will see, extracting audio from a video is straightforward and can be done in a few lines of code. However, transcribing the audio is more complex and requires a Large Language Model (LLM) to achieve the best possible results. In the previous article, we introduced the Whisper model from Hugging Face to transcribe our audio. We will deploy it locally to avoid potential costs. The pipeline we build will also allow you to try out other transcription models from Hugging Face. They all share a similar interface—a feature we will leverage in constructing our pipeline.

Article continues below

Want to learn more? Check out some of our courses:

Understanding and Deploying Edge AI

Learn More

Deep Learning for Computer Vision

Learn More

Python for Data Analysis

Learn More

How Preprocessing Works

For successful transcription, audio quality is paramount. Clear, high-quality audio increases the chances of producing an accurate transcription. Therefore, the complexity of the preprocessing pipeline depends largely on the recording conditions of the audio we need to dub. In environments with minimal background noise, such as a podcast, there is typically little need to filter out noise before extracting audio from the video. However, this is different in less ideal recording situations. In such cases, additional preprocessing may be required to remove problematic background noises and ensure high transcription quality.

As noted in the previous article, we will assume the video was recorded under good conditions. This allows us to focus solely on extracting the audio before transcription. Specifically, the audio will be extracted as a WAV file, a standard for storing uncompressed audio on computers. Using the WAV (Waveform Audio File) format ensures that our transcription model receives the highest quality data, maximizing transcription accuracy.

The Python library MoviePy will be used to extract audio from our video. MoviePy is a popular library that makes video editing tasks efficient and straightforward. It will be used during preprocessing and not only. MoviePy will also be beneficial later on to create the final product. This will be done by overlaying the dubbed audio onto the original video. However, we will cover that topic in a later article. For now, the focus is on extracting audio from video using MoviePy. This process only requires a few lines of code. For example, if we want to extract audio from an MP4 video and save it as a WAV file on our PC, we can simply run the following code:

from moviepy.editor import VideoFileClip

# Define the video location
path_to_original_video = " original_video.mp4"

# Load the MP4 file
video = VideoFileClip(path_to_original_video)

# Extract the audio
original_video_audio = video.audio

# Save the audio as a WAV file
original_audio = "original_audio.wav"
original_video_audio.write_audiofile(original_audio, codec='pcm_s16le')

As can be seen, audio can be extracted from a video and stored as a WAV file only using a few lines of code. Once the audio is extracted, we can proceed with the following step which is transcription.

The code above, while functional, does not qualify as a pipeline. This is especially true if we later decide to add additional preprocessing steps before the audio extraction. To ensure our code remains reusable, we will build a class called Preprocessor that will represent our preprocessing pipeline. This approach allows for flexibility and once we complete the entire auto-dubbing project, we can package all relevant code and build a small module or package. This could then be used by users to perform automatic dubbing of their videos. Our preprocessing class will look like this:

from moviepy.editor import VideoFileClip
import os

class Preprocessor:
    def __init__(self, video_path):
        """
        Initialize the Preprocessor with the path to the original video.

        :param video_path: Path to the original video file (e.g., MP4 file)
        """
        self.video_path = video_path
        self.video = None
        self.audio = None

    def extract_audio(self):
        """
        Convert the MP4 video audio to a WAV format and save it.
        """
        # Set a name for the output wav
        base_name = os.path.splitext(self.video_path)[0]
        wav_output_path = f"{base_name}_audio.wav"

        # Use context manager for VideoFileClip
        with VideoFileClip(self.video_path) as video:
            audio = video.audio
            # Use context manager for AudioFileClip
            with audio as audio_clip:
                audio_clip.write_audiofile(wav_output_path)

Structuring code like this is notably better, for a variety of reasons:

Object-Oriented Design
Resource Management with Context Managers
Dynamic Output File Naming

First, encapsulating the functionality within a Preprocessor class makes the code more modular and reusable. This object-oriented approach allows for easy extension by adding new methods for future preprocessing tasks. Additionally, object-oriented design ensures that the code is more readable and better documented. In other words, it will be way easier to navigate, especially for collaborators working on the project.

Intro to Programming: How to Write and Run Code

Second, the improved version of the code uses with statements, also known as context managers. For those unfamiliar, a context manager in Python is a construct that helps manage resources by ensuring they are properly acquired and released at the right time. The most common way to use a context manager is through the with statement. In our code, context managers handle the following:

# Opens the video file and ensures it is properly closed after processing,
# even if an error occurs.
with VideoFileClip(self.video_path) as video:
    audio = video.audio
    # Manages the audio part of the video,
    # ensuring that the audio resources are properly released 
    # after writing the audio file.
    with audio as audio_clip:
        audio_clip.write_audiofile(wav_output_path)

In layman's terms, a context manager acts like a helpful assistant that ensures tasks are started and finished properly, especially when working with resources that require careful attention. As an everyday analogy, imagine it like cooking dinner and using the oven. Before you start, you preheat the oven (setup). After you are done cooking, you turn off the oven (cleanup). Forgetting to turn off the oven could lead to problems, just like not properly managing resources in your code can cause issues.

Similarly, when your code uses resources like files or network connections, it needs to handle both starting and ending them properly. Context managers take care of the setup and cleanup for us, preventing potential issues like memory leaks, locked files, etc.

Intro to Programming: What Are Files in Python?

Lastly, the final advantage of the restructured code is its ability to dynamically generate the output WAV file name based on the input video file. It does this by stripping the video file's extension and appending _audio.wav. This makes the code more flexible and reduces the risk of overwriting files or hardcoding file names.

To use this improved pipeline for extracting audio from video files, after defining the class we just need to create an object of that class and use the extract_audio() method:

# Define the video location
path_to_original_video = " original_video.mp4"

# Extract the audio from the original video
preprocessing_pipeline = Preprocessor(path_to_original_video)
preprocessing_pipeline.extract_audio()

How Transcription Works

Transcription is arguably the most critical part of our pipeline, as the quality of the final result depends heavily on accurately transcribing the audio. Every step that follows builds upon this transcription. If the transcription is inaccurate, the subsequent translation will also be flawed.

A poor translation means the Text-to-Speech (TTS) model will generate synthetic speech that reads the wrong text in another language. This makes it difficult to synchronize with the original video. Even if synchronization is achieved, the original and dubbed audio will convey entirely different messages. Essentially, poor transcription sets off a domino effect, undermining all the following steps. Therefore, ensuring a high-quality transcription is essential for the success of the entire pipeline.

Natural Language Processing and its Applications in the Finance Sector

Among the abundance of available models for transcribing text, one of the best is the Whisper model created by OpenAI. Despite its strong reputation, many people avoid Whisper. They believe the only way to leverage its capabilities is through OpenAI's API, which requires paying OpenAI to transcribe audio for you.

However, OpenAI has open-sourced Whisper, enabling you to run it locally on your machine. Furthermore, the Hugging Face library provides a variety of pre-trained Whisper variants. This increases the chances of finding a version trained on the languages you need to transcribe. In this series of articles, I will demonstrate how to dub videos originally in English.

Dubbing videos in other languages is equally straightforward. It merely involves loading a different variant of the Whisper model, as they all share the same interface. For transcribing English audio, I will use the whisper-large-v3 model, which can be found here:

https://huggingface.co/openai/whisper-large-v3

As you will see when you follow the link, this is the default model that supports more than just English. However, if you want the best possible transcription for a language other than English, I recommend trying one of the variants fine-tuned for that specific language. You can compare the results with the default model. There are currently 269 fine-tuned variants available, which you can find by clicking on the models in the "Finetunes" section.

"Finetunes" section

The fine-tuned models sometimes outperform the default version. Therefore, it is worth a try, especially since they use the same interface.

To start, let's create a class for our Whisper model:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


# Create class for the Large Whisper transcription model
class WhisperLarge:
    """
    A class for transcribing audio using the Whisper model from OpenAI. This class handles the
    model loading, processing, and transcription of audio files using a pre-trained speech-to-text
    model. The language for transcription can be specified by the user.

    Attributes:
        device (str): The device to run the model on ('cuda:0' if a GPU is available, otherwise 'cpu').
        torch_dtype (torch.dtype): The data type for the model's tensors (float16 if GPU is available, otherwise float32).
        model (AutoModelForSpeechSeq2Seq): The pre-trained Whisper model for speech-to-text tasks.
        processor (AutoProcessor): The processor that handles tokenization and feature extraction for the model.
        pipe (pipeline): A Hugging Face pipeline object for automatic speech recognition with the model.
    """

    def __init__(self, model_name="openai/whisper-large-v3", language="en"):
        """
        Initializes the WhisperLarge class by loading the pre-trained Whisper model and setting up
        the necessary components for transcription. The user can specify the language for transcription.

        Args:
            model_name (str): The name of the pre-trained model to use. Defaults to "openai/whisper-large-v3".
            language (str): The language code for transcription (e.g., 'en' for English, 'es' for Spanish). Defaults to "en".
        """
        self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        self.torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

        self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_name, torch_dtype=self.torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
        )
        self.model.to(self.device)
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.pipe = pipeline(
            "automatic-speech-recognition",
            model=self.model,
            generate_kwargs={"language": language, "task": "transcribe"},
            tokenizer=self.processor.tokenizer,
            feature_extractor=self.processor.feature_extractor,
            torch_dtype=self.torch_dtype,
            device=self.device,
            return_timestamps=True
        )

    def transcribe(self, audio_path):
        """
        Transcribes the audio from the given file path using the Whisper model. The result includes the
        transcribed text as well as timestamps for when certain words were spoken.

        Args:
            audio_path (str): The file path to the audio file that needs to be transcribed.

        Returns:
            dict: A dictionary containing the transcription result, including:
                - 'text' (str): The transcribed text.
                - 'timestamps' (list): A list of timestamps indicating when certain words were spoken.
        """
        result = self.pipe(audio_path)
        return result

The WhisperLarge class is designed to transcribe audio files using OpenAI's Whisper model. This is done by setting up a speech recognition pipeline using the Hugging Face transformers library. When initialized, the class checks for GPU availability to optimize performance and sets the appropriate data types. It loads the pre-trained Whisper model and processor based on the specified model name and language.

Intro to Generative AI: What Are Diffusion Models?

The class creates a speech recognition pipeline configured to transcribe audio. It returns a dictionary containing the transcribed text along with timestamps for when words are spoken. An example of a timestamped transcription provides us with both the text and the timing of when it was spoken during the video. It looks something like this:

[{'timestamp': (0.96, 6.8), 'text': ' Machine learning is a branch of artificial intelligence and computer science that focuses'}, {'timestamp': (6.8, 12.16), 'text': ' on using data and algorithms to enable artificial intelligence to imitate'}, {'timestamp': (12.16, 15.92), 'text': ' the way that humans learn, gradually improving its accuracy.'}]

The transcribe method is the method that provides us with this transcription. It takes the path to an audio file and processes it through this pipeline. It then returns a dictionary containing the transcribed text and corresponding timestamps, effectively enabling accurate and efficient speech-to-text conversion. However, instead of using this class directly, let's create another class called Transcriber.

# Create general class for transcription models
class Transcriber:
    def __init__(self, recognizer):
        """Initialize with a speech recognizer instance.

        Args:
            recognizer: A transcription model from Hugging Face that we plan on using.
        """
        self.recognizer = recognizer

    def transcribe(self, audio_path, **kwargs):
        """Transcribe audio using the provided transcription model.

        Args:
            audio_path (str): Path to the audio file.
            **kwargs: Additional arguments to pass to the recognizer's transcribe method.

        Returns:
            Union[str, dict]: The transcription result, which can be:

                - A string containing the transcribed text.
                - A dictionary containing additional details, such as:
                    - 'text' (str): The transcribed text.
                    - 'timestamps' (list, optional): A list of timestamps indicating when certain words were spoken.

        Note:
            The exact return type and contents depend on the specific recognizer used.
            Some models may return only the transcribed text as a string, while others may
            return a dictionary with additional information like timestamps.
        """
        return self.recognizer.transcribe(audio_path, **kwargs)

The Transcriber class is designed to provide a flexible interface for transcribing audio files using any speech recognition model from the Hugging Face transformers library. It achieves this by accepting a recognizer object, such as the previously defined WhisperLarge class or any other model, that implements a transcribe method with a consistent interface. Many models in the transformers library share similar methods and interfaces, particularly for automatic speech recognition tasks. This design allows you to easily switch between different models without changing the core logic of the transcription pipeline.

This class simply delegates the transcription task to the recognizer's transcribe method, making it easy to switch between different models. Do keep in mind that not all models can return timestamps, some only return text. One of the key advantages of the Whisper model is its ability to return timestamps, making it much simpler to synchronize the dubbed audio with the original video. This process will be covered in a future article.

To transcribe audio using this pipeline, we can run the following code:

# Define path to the WAV file
# extracted from the original video
wav_path = "original_video_audio.wav"


# Initialize the WhisperLarge transcriber
# and transcribe the audio
transcriber = Transcriber(recognizer=WhisperLarge())
transcription = transcriber.transcribe(wav_path)

The transcription we will get from the WhisperLarge model is a dictionary with two keys:

text
chunks

By accessing the value connected with the text key, we will access just the transcribed text. We will use this text when translating from the original language to the target language. Accessing the value connected to the chunks key provides timestamps, similar to the earlier example. These timestamps will be useful for synchronizing the dubbed audio with the original video.

How to Generate Videos with AI

In this article, we explored the critical roles of preprocessing and transcription in building an automatic dubbing system. We started by emphasizing the importance of high-quality audio and demonstrated how to extract audio from video files using the MoviePy library. By structuring our code within a Preprocessor class we ensured that our preprocessing pipeline is efficient and reusable. Additionally, by employing context managers, we made it ready for future enhancements.

We then examined the complexities of transcription, highlighting its role as the foundation for all subsequent steps in the dubbing process. Utilizing OpenAI's Whisper model via Hugging Face, we established a flexible transcription pipeline that delivers accurate transcriptions along with timestamps. By designing a general Transcriber class, we made it easy to swap out models and accommodate various languages. At the same time, we ensured that the interface remains consistent.

With the audio extracted and accurately transcribed, we are well-prepared to move forward in our auto-dubbing journey. In the next article, the focus will be on the translation component, transforming our transcribed text into the target language while maintaining the integrity of the original message.