In our previous article, we discussed generating a synthetic voice to read aloud the translated version of our original text, effectively creating dubbed audio for the content. However, one crucial step remains to complete the automatic dubbing system: achieving perfect synchronization. The dubbed audio must align seamlessly with the original video to ensure the speech timing matches the visual cues. This alignment is vital for maintaining the natural flow of the video, ensuring that the dubbed audio plays precisely when the person on screen is speaking. Finally, we will merge the synchronized audio with the original video to produce the fully dubbed version.
How Synchronization Works
The complexity of achieving synchronization largely depends on the structure of your system, particularly the translation process. A key challenge lies in managing differences in length between the translated text and the original content.
Ideally, the duration of the translated text should closely match that of the original. It is possible to adjust the speed of the dubbed audio generated by the TTS system, either by compressing or stretching it. Despite this, significant discrepancies in length between the original and translated pronunciations can lead to major issues, most notably audio distortion. To avoid this, our goal is to produce a translation that, when spoken, takes approximately the same amount of time as the original text.
Advanced dubbing systems take synchronization a step further by not only focusing on timing but also matching the phonetic qualities of the translation with the original audio. Phonetics, the study of speech sounds, plays an essential role in aligning the translated speech with the original actor’s lip movements and mouth shapes. It also addresses the timing and rhythm of sounds, known as prosody. However, achieving this level of precision often requires human intervention. Since our goal is to build a fully automatic system, we will not prioritize this aspect.
In the article on translation, we have already ensured that the translated text closely matches the length of the original. The next step is to adjust the audio output from the TTS system. To achieve this, we need to insert "silence" at the beginning and end of the audio generated by the TTS system. This is essential because the original audio includes not only the spoken sentence but also a few "empty spaces" before and after it.
It is important to note that synchronization becomes much more challenging in cases of significant differences in the spoken duration of the translated text compared to the original. In such cases, you can use libraries like librosa. However, depending on the degree of difference, even that might not be sufficient. As a result, I recommend rephrasing the translation to match the length of the original sentence better. Fortunately, this issue is rare, as most translation algorithms are designed to produce translations that are similar in length to the original text.
Article continues below
Want to learn more? Check out some of our courses:
How to Add "Silence"
In the context of this article, I will demonstrate how to add silence to the beginning and end of our TTS-generated speech. To do so we will create a class designed to keep the process modular and reusable. Before constructing the class, however, we need to import the necessary components. We will import the following from the moviepy.editor module:
from moviepy.editor import AudioFileClip, concatenate_audioclips, AudioClip
The AudioFileClip class enables us to create an audio clip from an existing audio file. It reads the file without loading the entire content into memory, making it an efficient way to access audio data. The AudioClip class serves as the base class for audio clips. It can be subclassed or used to create custom audio clips by defining a function that generates audio frames. Lastly, the concatenate_audioclips function enables the sequential concatenation of multiple audio clips, allowing them to play one after another in a single continuous audio track.
Now, let's create our class.
class NaiveAudioSynchronizer:
def __init__(self, original_audio, dubbed_audio, transcribed_timestamps):
"""
Initialize the NaiveAudioSynchronizer with paths to original and dubbed audio files and timestamps.
Args:
original_audio (str): Path to the original audio file.
dubbed_audio (str): Path to the dubbed audio file.
transcribed_timestamps (list): List of timestamps from the transcribed text, where each element contains
a 'timestamp' key with start and end times.
"""
self.original_audio = original_audio
self.dubbed_audio = dubbed_audio
self.transcribed_timestamps = transcribed_timestamps
def get_wav_length(self, filename):
"""
Calculate and return the length of a WAV audio file in seconds.
Args:
filename (str): Path to the WAV audio file.
Returns:
float: Duration of the audio file in seconds.
"""
with wave.open(filename, 'rb') as wav_file:
num_frames = wav_file.getnframes()
sample_rate = wav_file.getframerate()
duration = num_frames / float(sample_rate)
return duration
def calculate_silence_durations(self):
"""
Calculate the required silence durations for the start and end of the dubbed audio.
Returns:
tuple: Start and end silence durations in seconds.
"""
# Get the length of original and dubbed audio
original_audio_length = self.get_wav_length(self.original_audio)
dubbed_audio_length = self.get_wav_length(self.dubbed_audio)
print(f"Original Audio Length: {original_audio_length} seconds")
print(f"Dubbed Audio Length: {dubbed_audio_length} seconds")
# Silence to add at the beginning
start_silence_length = self.transcribed_timestamps[0]["timestamp"][0]
# Silence to add at the end
end_silence_length = original_audio_length - (dubbed_audio_length + start_silence_length)
return start_silence_length, end_silence_length
def create_silence(self, duration, sample_rate=44100):
"""
Create a silent audio clip of a given duration (in seconds).
Args:
duration (float): Duration of the silence in seconds.
sample_rate (int): Sampling rate of the audio, default is 44100.
Returns:
AudioClip: A silent audio clip.
"""
return AudioClip(lambda t: np.zeros((1,)), duration=duration, fps=sample_rate)
def add_silence_to_audio(self, output_path, start_silence_duration=0, end_silence_duration=0):
"""
Add silence to the start and end of the dubbed audio and save the result.
Args:
output_path (str): Path to save the final synchronized audio file.
start_silence_duration (float): Duration of silence to add at the beginning in seconds.
end_silence_duration (float): Duration of silence to add at the end in seconds.
"""
# Load the dubbed audio clip
audio_clip = AudioFileClip(self.dubbed_audio)
# Create silent clips for the start and end
start_silence = self.create_silence(start_silence_duration)
end_silence = self.create_silence(end_silence_duration)
# Concatenate start silence, original dubbed audio, and end silence
final_audio = concatenate_audioclips([start_silence, audio_clip, end_silence])
# Write the result to a file
final_audio.write_audiofile(output_path)
print(f"Synched audio saved at: {output_path}")
def synchronize_audio(self, output_path):
"""
Main function to calculate silence durations and synchronize the audio by adding silence.
Args:
output_path (str): Path to save the final synchronized audio file.
"""
# Calculate the start and end silence durations
start_silence_duration, end_silence_duration = self.calculate_silence_durations()
# Add silence and save the synchronized audio
self.add_silence_to_audio(output_path, start_silence_duration, end_silence_duration)
The NaiveAudioSynchronizer class is designed to help synchronize a dubbed audio file with an original audio file by adding silence to the beginning and end of the dubbed audio. Let's take a closer look at the class and its methods.
This class takes the original audio, dubbed audio, and transcribed timestamps of the original audio as inputs. Using this information, it calculates and adds the appropriate silences at the start and end of the dubbed audio. The class includes several key methods:
- __init__()
- get_wav_length()
- calculate_silence_durations(): Method to compute the amount of silence that needs to be added to the start and end of the dubbed audio.
- create_silence(): Method to generate a silent audio clip of a specified duration.
- add_silence_to_audio(): Method to add silence at the beginning and end of the dubbed audio and save the new audio.
- synchronize_audio(): The main method that orchestrates the calculation of silence durations and synchronization of the dubbed audio.
The __init__() method is the constructor used to initialize the class. It takes three parameters: the path to the original audio file, the path to the dubbed audio file, and a list of timestamps representing the start and end points of the original transcribed sentences. These arguments are stored as instance attributes, making them available to other methods in the class.
The get_wav_length() method calculates and returns the length of a given WAV file based on its path. In essence, it opens the WAV file, retrieves the number of frames and the sample rate, and then calculates the duration of the audio stored in the WAV file. This is done by dividing the number of frames by the sample rate.
The calculate_silence_durations() method computes how much silence needs to be added to the start and end of the dubbed audio. The start silence is calculated based on the first timestamp, while the end silence is calculated to ensure that the total length of the dubbed audio matches the length of the original audio.
The create_silence() method generates a silent audio clip for a specified duration. In digital audio, complete silence is represented by an array filled with zeros. To achieve this, we use the AudioClip class from MoviePy. Its content is defined as an array of zeros, resulting in an audio clip with no sound.
The add_silence_to_audio() method adds the calculated silence to both the beginning and end of the dubbed audio and saves the result. The process is pretty straightforward. First, we load the dubbed audio clip using AudioFileClip. Next, we call the create_silence method twice, to generate both the starting and ending silence. Finally, we use the concatenate_audioclips() function to concatenate the two silent clips with the dubbed audio clip.
Lastly, the synchronize_audio() method manages the entire process, from calculating the silence requirements to adding silence to the dubbed audio. Essentially, it serves as the main method used to create a synchronized audio file. It operates by delegating tasks to the lower-level functions defined earlier, such as calculate_silence_durations() and add_silence_to_audio().
It produces the end result: a WAV file containing the synchronized version of the dubbed audio, ready to be overlaid over the original video, completing the automatic dubbing pipeline.
How Dubbing Works
Finally, we can overlay the new audio over the original video. This process is straightforward, as the audio was already synchronized in the previous step. To overlay the new audio over the old video, MoviePy will be used again. To be more precise, we have to import one additional component from the moviepy.editor module:
from moviepy.editor import VideoFileClip
The VideoFileClip class enables the creation of video clips from existing video files. It offers methods for manipulating video content, such as extracting audio tracks, trimming, resizing, and performing other editing operations.
Now, let's construct a class that will allow us to overlay the new audio over the old video.
class DubbedVideoCreator:
def __init__(self, video_path, audio_path):
"""
Initialize the VideoAudioSynchronizer with paths to the original video and the dubbed audio.
Args:
video_path (str): Path to the original video file.
audio_path (str): Path to the synchronized dubbed audio file.
"""
self.video_path = video_path
self.audio_path = audio_path
self.video_clip = None
self.audio_clip = None
def load_video(self):
"""
Load the video file from the provided video path.
"""
self.video_clip = VideoFileClip(self.video_path)
print(f"Loaded video: {self.video_path}")
def load_audio(self):
"""
Load the audio file from the provided audio path.
"""
self.audio_clip = AudioFileClip(self.audio_path)
print(f"Loaded audio: {self.audio_path}")
def set_audio_to_video(self):
"""
Set the dubbed audio to the video.
Returns:
VideoClip: The video with the new audio track applied.
"""
if self.video_clip is None or self.audio_clip is None:
raise ValueError("Video or audio not loaded. Make sure to load both video and audio.")
video_with_audio = self.video_clip.set_audio(self.audio_clip)
print("Audio set to the video.")
return video_with_audio
def export_video(self, output_path, codec="libx264", audio_codec="aac"):
"""
Export the video with the dubbed audio to a file.
Args:
output_path (str): Path to save the final dubbed video.
codec (str, optional): The codec to use for video encoding. Default is 'libx264'.
audio_codec (str, optional): The codec to use for audio encoding. Default is 'aac'.
"""
video_with_audio = self.video_clip.set_audio(self.audio_clip)
video_with_audio.write_videofile(output_path, codec=codec, audio_codec=audio_codec)
print(f"Dubbed video saved at: {output_path}")
def synchronize_and_export(self, output_path):
"""
Load the video, set the dubbed audio, and export the final video.
Args:
output_path (str): Path to save the final video with dubbed audio.
"""
self.load_video()
self.load_audio()
self.export_video(output_path)
This class enables us to easily replace the original audio of a video file with new audio and export the updated file. In our case, we will replace the original audio with the dubbed audio. Let's break down the process.
The __init__ constructor of the class is used to initialize the class with the paths to the video and audio files. These paths allow us to locate the source files. During initialization, we provide the paths to both the video we want to modify and the audio we want to overlay over the video. Additionally, we create two placeholders: one to store the video object after it's loaded, and another to store the audio object after it's loaded.
The load_video method loads the video file specified by video_path into the video_clip attribute. It uses the VideoFileClip class to achieve this. Once the video is loaded, the method prints a message to confirm the successful loading of the video.
The load_audio method performs the exact same function as the load_video method, except for the audio. It loads the audio file specified by audio_path into the audio_clip attribute. It uses the AudioFileClip class to achieve this. Again, once the audio is loaded, the method prints a message to confirm its successful loading.
The set_audio_to_video method synchronizes the loaded audio with the loaded video. It first checks that both the video and audio files are loaded. Afterward, it replaces the video's original audio track with the loaded audio. Finally, it returns the video with the updated audio track.
The export_video method exports the video with the dubbed audio track to a specified file. It combines the video and audio, and then writes the final output to a file using the write_videofile function. This method supports customizations, such as specifying codecs for video (codec) and audio (audio_codec). However, most of the time you should stick with the default values. After the export, the method will also print a message to confirm the export location.
The synchronize_and_export is the final method. It simplifies the entire process of synchronizing and exporting the video in one step.
This method is particularly useful as it simplifies the usage of the class significantly. Let's now demonstrate how to use this class:
# Create the dubbed video
video_dubber = DubbedVideoCreator(
video_path="original_video.mp4",
audio_path="synchronized_audio.wav")
# Perform dubbing
video_dubber.synchronize_and_export(output_path="dubbed_video.mp4")
As shown above, we first create an instance of the class, providing the paths to the original video and the synchronized audio. Then, we simply run the synchronize_and_export method, specifying the location and filename for the dubbed video. This method will run the entire pipeline in the background:
- Load the video using the load_video method.
- Load the audio using the load_audio method.
- Overlay the audio over the video using the set_audio_to_video method.
- Export the final result using the synchronize_and_export method.
And with that, we finally have our finished product, a dubbed version of the original video.
In the final article of this series on building an automatic dubbing system, we explored how to synchronize the audio generated by the TTS system with the original video. We also demonstrated how to create the final product by overlaying the synchronized audio onto the video. By now, you should be equipped to construct a complete automatic dubbing pipeline. This can be achieved by integrating the code from the various articles in this series.