Roadmap to Building a Video Editing App in Python

A detailed roadmap for building a video editing app in Python.
By Boris Delovski • Updated on Feb 25, 2025
blog image

The process of recording a professional-level video can be frustrating, especially when you stumble over a sentence midway through the script. At that point, you have two options: stop and start over, or continue recording and edit the mistake later. If you pause to note the time of the error, it disrupts your flow. This makes it harder to stay focused and maintain the video's quality. However, if you keep recording without marking the error, you'll need to search through the video later to locate and fix it. This task becomes an even bigger challenge when multiple errors occur. Often, the more practical approach is to push through the recording and dedicate extra time afterward to identify and edit out the mistakes. 

Fortunately, recording and editing videos no longer have to be this difficult. Recent advancements in video editing tools now allow you to edit videos as easily as editing a Word document. These tools generate a transcript of your video, enabling you to remove specific words or sections of the transcript to cut out unwanted parts of the video. While such software typically requires a subscription, this series of articles will demonstrate how to create a Python-based app. This app will offer similar functionality without the recurring cost.

How to Generate a Transcript with Timestamps

Creating a preprocessing pipeline is the first step in building our app. This pipeline should be able to do the following tasks with a given video:

  • Generate a transcript of the video 
  • Provide timestamps for each word in the transcript

If you already have a transcript, such as when reading directly from a script in the video, the process may be simpler and slightly different. However, for the sake of this example, I’ll assume you don’t have a transcript. This scenario is more challenging and offers a more comprehensive demonstration.

Whisper Timestamped, a modified version of OpenAI's Whisper model, will be used. This enhanced version builds upon the original Whisper model by providing precise word-level timestamps and confidence scores for multilingual speech recognition. While the standard Whisper model provides approximate timestamps for speech segments, Whisper Timestamped delivers more detailed timing for each word. Additionally, it is completely open-source.

Since this model is widely accessible, there is no need to rely on the openai-whisper package. Instead, we can easily access it through HuggingFace, a popular platform for sharing deep learning models, especially those based on the Transformers architecture. This approach simplifies the process by removing the need for extra Python library installations.

To achieve the desired results, we first need to extract the audio from our video as a WAV file. A WAV file (Waveform Audio File Format) is a widely used digital audio file format created by Microsoft and IBM. It is commonly used to store uncompressed audio data on Windows systems but is also supported across most other platforms.

Using uncompressed audio data ensures that the Whisper Timestamped model has the best quality input. This maximizes its ability to generate an accurate transcript with detailed timestamps for each word. By providing the model with a high-quality WAV file, we avoid issues that can arise from audio compression, such as loss of detail or artifacts. These issues could negatively impact transcription accuracy.

MoviePy, a Python library designed for video editing, will be used to extract the audio MoviePy provides a simple and efficient way to handle various video manipulation tasks, including creating, editing, and compositing video files. One of its features allows us to easily extract audio from a video and save it as a WAV file.

This process is straightforward and requires only a few lines of code. MoviePy handles the extraction, ensuring that the resulting WAV file is of high quality and ready to be processed by the Whisper Timestamped model.

Article continues below

How to Create a Video Editing Pipeline

After extracting audio and generating a transcript with timestamps using the Whisper Timestamped model, the next step is to create a video editing pipeline. Each word in the transcript comes with start and end timestamps. This means that by removing a word from the transcript we can easily identify and exclude the corresponding segment of the video.

For example, the model will return timestamps that look like this:

{"text": " Machine","timestamp": [0.94, 1.26] }

This shows that the word "Machine" starts at 0.94 seconds and ends at 1.26 seconds. To remove "Machine" from the video, we can simply exclude the segment from 0.94 to 1.26 seconds.

MoviePy will be used again to manage video editing. The process involves comparing the original transcript with the edited version to determine which words have been removed. Using the timestamps of the missing words, we’ll identify the video segments to exclude. Instead of directly "removing" those segments, our pipeline will construct the final video by concatenating the segments we want to keep.

Imagine a one-sentence video:

Deep Learning is a subset of Machine Learning that uses neural networks, which are models designed to simulate how the human brain processes information. 

Now imagine we want to remove a part of this sentence, resulting in the following:

Deep Learning is a subset of Machine Learning that uses models designed to simulate how the human brain processes information.

The way our pipeline is going to create the final video is by cutting the video into three parts:

  • Part 1: Everything before "uses"
  • Part 2 (Unwanted): The segment containing "neural networks, which are"
  • Part 3: Everything after "are"

To produce the edited video, we will use MoviePy to concatenate the sections we want to keep. We will skip the segments that correspond to words or phrases we wish to remove. However, performing this cutting and concatenation process for every individual change would be highly inefficient and time-consuming. To address this, we will implement a system that allows users to preview their edits in real time. This approach provides immediate feedback without the need to create a new video file for each small adjustment.

The dynamic preview system will function by temporarily displaying only the segments of the video that match the edited transcript. Instead of exporting a new video file after each modification, the app will use the timestamps from the Whisper Timestamped model to determine which parts of the video to show and which to skip. This allows users to interact with the transcript, make edits, and immediately see the impact on the video without delays.

This system is designed to ensure a smooth and non-destructive workflow. The original video remains unchanged throughout the editing process. The app will perform the actual cutting and concatenation steps only when the user is fully satisfied with their edits, thus creating the final, exported video. By delaying the processing until the end, we optimize performance and avoid unnecessary operations

Creating this dynamic preview system is not a trivial task. It requires efficiently rendering temporary segments of the video without saving them as files. At the same time, it provides an interactive interface for users to edit the transcript and view changes in real time. This functionality will be the focus of the final step in building the app. It ensures that users can make and visualize their edits seamlessly before finalizing the video.

How to Design a User Interface 

To make the app user-friendly, we need to create an interface that simplifies its functionality. Python offers various libraries for building user interfaces, such as Gradio, Streamlit, and others. These libraries simplify the creation of front-end interfaces, ensuring that even first-time users can navigate the app with ease. 

In this case, Streamlit will be used to build the interface. There is no need for anything too elaborate for our purpose, therefore the design will be simple. The interface will feature an upload button, allowing users to select the video file they want to process from their system. Once the video is uploaded, the app will automatically start the preprocessing step. It will extract the audio as a WAV file, process it with the Whisper Timestamped model, and generate a transcription. The transcription will be displayed in an editable text box, alongside a video player that shows the uploaded video. At this stage, the video remains unedited, allowing users to play it in its entirety from start to finish.

After editing the text in the transcription box and saving their changes, users will see the video player update. It will display only the segments of the video that correspond to the edited transcript. This preview functionality lets users view their changes dynamically. It does so without altering the original video or generating a new version after each adjustment.

Beneath the video player, there will be a button to export the final edited video. Once users are satisfied with the preview, they can click this button to apply the video editing pipeline. This will cut and concatenate the necessary segments to produce the edited version of the original video. 

Additionally, the interface will include an undo button, allowing users to revert their most recent edit. This feature is particularly useful for refining their work.

The app's approach of avoiding video edits with every minor change ensures efficient functionality. The only time-consuming step happens during the initial upload because the video must be processed by the Whisper Timestamped model to create the transcription. However, this process only happens once per upload. Afterward, all actions, except exporting the final version, will be nearly instantaneous, providing a smooth and responsive user experience.

Building a video editing app in Python presents an exciting opportunity to combine advanced AI technologies with practical video manipulation tools. By utilizing models like Whisper Timestamped for precise transcription and libraries like MoviePy for video processing, this project offers a seamless solution for editing videos. It provides a user-friendly experience, making video editing as easy as editing text.

Through careful integration of features like timestamped transcripts, dynamic previewing, and a user-friendly interface powered by Streamlit, users can efficiently make edits. This approach ensures that the original video remains unaltered until the final step. Such a workflow preserves the quality of the source material while ensuring a responsive and intuitive editing experience.

In the next article in this series, we will start building our app by first creating the pipeline to extract audio from the video. Afterward, we will generate a transcript using the Whisper Timestamped model.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.