How Does AI Generate Music?

A comprehensive overview of AI-music generation.
By Boris Delovski • Updated on Sep 19, 2024
blog image

Generative AI is currently revolutionizing various forms of artistic expression. It has gained significant attention, particularly in the field of image creation. However, its applications span far beyond this domain. One exciting development is the emergence of AI models capable of generating music

Initially, these models produced outputs that were far from meeting human standards. The early compositions were often rudimentary and lacked the complexity and emotional depth characteristic of music made by humans. Fortunately, the technology has rapidly evolved. Today, AI-generated music has reached a level of sophistication that not only makes it pleasant to the ear but also commercially viable.

As a result of this progress, there are now companies dedicated to creating and distributing AI-produced music. These companies leverage advanced algorithms and vast datasets to craft different melodies and harmonies. They even create full compositions that can be used in a variety of contexts, starting from background scores in films and video games to personalized playlists for individual users.

Let us dive deep into the field of music generation with AI.

What Makes Generating Music with AI Challenging?

Creating music with AI presents a significant challenge because of the inherently temporal nature of music. Unlike image generation, which deals with static moments in time, music unfolds over time and requires a dynamic understanding of this progression. Therefore, we must employ Deep Learning models that can effectively capture this temporal aspect. Without doing so, there is no chance of producing music that sounds even remotely human-made. The more complex the music we want to create, the more advanced the model we need to use.

For instance, creating a simple melodic progression for a single instrument, while not trivial, is far less complicated compared to composing a piece that involves multiple instruments. Each instrument has its temporal dynamics that interact and evolve over time. In such cases,  it is impossible to even introduce a standard chronological ordering of notes. In polyphonic music, notes are often organized into chords, arpeggios, melodies, and more, making the task even more challenging.

This means that we essentially need to tackle three problems simultaneously when creating these models. Additionally, we could categorize the models that are currently used into two types: those designed to generate new music from scratch, and those intended to enhance the capabilities of existing music production models. However, we will not explore this distinction in detail here. Instead, let us focus on how we address each of the three main challenges to develop a model capable of generating music.

Let us take a quick look at the most popular models designed for music generation.

What Are Long Short-Term Memory Networks (LSTMs)?

Recurrent Neural Networks (RNNs), especially the advanced variant known as Long Short-Term Memory Networks (LSTMs), are widely used for processing sequential data. LSTMs, as a specialized type of RNN, are designed to capture both short-term and long-term dependencies. This ability makes them ideal for the music generation. They can consider the entire sequence of preceding notes, instead of only the most recent note. LSTMs achieve this through special memory cells with a gating mechanism involving three gates: an input gate, a forget gate, and an output gate. These gates help the cells decide which information to add, retain, or discard. This enables the network to effectively manage dependencies over long sequences.

However, LSTMs and other RNNs are inherently limited by their sequential nature. They process data one step at a time, which slows down training due to their lack of parallel processing capabilities. Consequently, quick and efficient training of an LSTM on a large audio dataset is challenging. As a result, LSTMs have become less popular since the early stages of music generation research. Nowadays, we rarely use pure LSTMs for music generation, instead incorporating LSTM layers into larger, more complex, and more powerful models.

Article continues below

What Are Convolutional Neural Networks (CNNs)?

Convolutional Neural Networks (CNNs) are a type of deep learning model initially developed for computer vision tasks, achieving notable success in image recognition and related applications. While they were originally designed for and excelled in the field of computer vision, researchers have since adapted CNNs to handle sequential data. By modifying the standard CNN architecture, it is now possible to generate sequences of notes, making CNNs suitable for music generation.

The main advantages of CNNs over previously mentioned models are:

  • local pattern detection
  • parallel processing
  • hierarchical feature learning

CNNs are excellent at identifying local patterns in data. This allows them to recognize specific features in images, such as edges and shapes that correspond to different objects. This ability is also useful in music, enabling CNNs to identify and learn patterns such as motifs, rhythms, and harmonies from the musical segments they are trained on. 

Additionally, CNNs can be trained much faster on larger audio datasets thanks to their parallel processing capabilities. In contrast to sequential models like LSTMs, CNNs enable more efficient training, which is crucial for practical applications where speed is a major factor.

Finally, by using CNNs that consist of multiple layers, CNNs can learn hierarchical representations of music. On the one hand, lower levels usually capture features of music that are more basic, such as individual notes. On the other hand,  later layers capture more advanced features, such as chords and melodies.

One of the best examples of a CNN for music generation is WaveNet. Originally designed for generating human speech, WaveNet’s high fidelity and flexible architecture also made it a popular choice for music generation. While it excelled at producing short, high-quality audio segments, it struggled with generating long audio sequences. The impressive short chords it creates lack the complexity and structure of a full composition. To achieve high-quality long compositions, further advancements in generative models were necessary, and these advancements did come in the form of new and innovative models.

What Are Generative Adversarial Networks (GANs)?

GANs are special neural models. They consist of two neural networks: a generator and a discriminator. The generator tries to create fake data that mimics real data, while the discriminator tries to distinguish between real and fake data. These two networks are trained together, each improving with every iteration. The ultimate goal is to train a generator so effectively that even the trained discriminator cannot tell the difference between fake and real data. This style of learning provides significant advantages for music generation. 

Firstly, GANs inherently do not require labeled data. This eliminates the need for annotations and simplifies data collection and dataset preparation. Secondly, the previous models discussed learned to create music similar to the training dataset. They did not truly understand the underlying concepts of music creation but merely replicated patterns based on existing rules. 

In contrast, GANs are designed to generate something new. The need for annotated data is eliminated, which improves this approach by allowing for the inclusion of diverse musical styles in the dataset. Finally, the model is considered trained only when the discriminator can no longer distinguish between the generated music and real music. This ensures that the output from a GAN is always of high quality.  As a result, the music sounds natural and similar to something composed by humans.

GANs come with their own set of flaws. Several challenges arise when training a GAN to generate music. Firstly, GANs are notoriously difficult to train and often encounter an issue known as mode collapse. This occurs when the GAN stops producing novel examples and instead generates the same patterns repeatedly. Consequently, the music produced sounds monotonous and repetitive. Additionally, evaluating the quality of generated music is challenging because of the subjective nature of music itself. While there are metrics that can approximate the quality of music samples, human evaluation is often necessary as the final test of music quality.

Over time, several GANs have gained recognition and popularity for their ability to create high-quality music:

  • MuseGAN
  • GANSynth
  • WaveGAN

MuseGAN is one of the most renowned GANs for music generation. It gained fame for its ability to generate multi-track music. This means MuseGAN can produce audio samples where multiple instruments work together to create cohesive musical compositions.

GANSynth represents a significant advancement in audio generation by synthesizing entire audio sequences in parallel. This makes it quite faster than other models. It is also a model that can effectively disentangle and recognize some very specific features of sound, such as pitch and timbre. This ability allows GANSynth to create audio outputs that are more nuanced and varied.

WaveGAN is unique due to the exceptionally high-quality audio it produces. Indeed,  it may not be as fast as GANSynth or specialized in creating multi-track music like MuseGAN. However, its strength lies in the raw fidelity of the sound it generates. This makes it ideal for scenarios where the pure quality of the produced audio is prioritized over speed and compositional complexity.

What Are Transformers?

Transformers have become the leading neural network architecture today. Originally designed for language translation, their self-attention mechanism has revolutionized the way relationships and patterns in data are captured. Over time, the Transformers architecture has been adapted to various fields, leading to innovations like Vision Transformers for Computer Vision tasks. Similarly, this architecture has been tailored for music generation, resulting in one of the most popular models in the field: MusicGen.

MusicGen differs from the previously discussed models because it generates musical pieces from text descriptions. For example, you can provide a prompt like "80s rock song with a cool guitar riff," and it will create music that fits that description. Essentially, MusicGen is a controllable music generation model that offers significant control over the type of music produced. The intricate workings of this model will not be explored in this article. Instead, a detailed explanation and implementation in Python will be covered in future articles.

What Are the Ethics of Generating Music Using AI?

The ethics of generating music with AI present a nuanced and multifaceted dilemma, particularly when it comes to distinguishing between genuine innovation and potential plagiarism. AI models like Generative Adversarial Networks (GANs) and Transformers are trained on vast datasets of existing music, encompassing a wide range of genres, styles, and compositions. This extensive training data is crucial for the AI to learn the intricate patterns and structures that define music. However, this very process of learning from pre-existing works raises significant ethical questions about the originality of the music produced by these models.

One of the main ethical concerns surrounding AI-generated music is the potential infringement of intellectual property rights. When an AI model creates a new musical piece, it does so based on the patterns and elements learned from its training data. This reliance on existing works can lead to situations where the generated music closely resembles existing compositions, raising concerns about plagiarism. 

The challenge lies in the fact that music, by its nature, is highly iterative and often builds upon previous works. This is why it is complex to determine whether an AI is producing something truly original or merely replicating elements from its training data. This issue is further complicated by the fact that even human composers frequently produce music that bears similarities to earlier works. This makes it especially difficult to establish a clear metric for evaluating plagiarism in AI-generated music.

The ethics of AI-generated music involve a delicate balance between innovation and intellectual property rights. Ensuring that AI-generated music is truly original and not a mere replication of existing works is a complex task that requires clear guidelines and ethical standards. As AI technology continues to evolve, it is crucial to address these issues to protect the rights of original artists and preserve the value of human creativity within the music industry.

The landscape of AI-generated music has evolved significantly over time. Models like LSTMs, CNNs, GANs, and Transformers have each introduced innovative approaches to address the many challenges of music generation. These advancements offer exciting possibilities and provide valuable tools for those without traditional music training. However, it is crucial to ensure that the music produced by these models does not infringe on the intellectual property rights of human composers. As we have explored in this article, creating music with AI involves navigating complex challenges and ethical considerations. In a future article, we will take a closer look at how to implement a music generation model using Python. 

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.