Anatomy of an AI Voice Assistant

This article is a detailed overview of AI voice assistants.
By Boris Delovski • Updated on Aug 5, 2024
blog image

AI voice assistants have become commonplace in today's world. There are high chances that you carry one in your pocket every day. Smartphone assistants, such as Siri and Alexa, are so frequently used that many people take them for granted. While they are everywhere, the majority of people do not truly grasp the way these AI voice assistants operate. They also do not realize how simple it has become to create a custom one.

This shift is highly influenced by the widespread availability of Large Language Models (LLMs). Not long ago, AI voice assistants were little more than glorified chatbots. More often than not they were not even effective. However, things have changed with the ability of modern LLMs to maintain coherent and extended conversations and to process large amounts of information at high speed. As a result, today's voice assistants have significantly advanced. Now they resemble sophisticated systems previously seen only in science fiction.

We can understand how these AI voice assistants function and even learn how to potentially create one ourselves using Python. To do so, we need to explore two crucial technologies: speech-to-text and text-to-speech.

How Does an AI Voice Assistant Work

Modern AI voice assistants are composed of multiple models, three to be more precise:

  • Speech-to-Text Model
  • Large Language Model
  • Text-to-Speech Model

Each model plays a distinct role in the functionality of an AI voice assistant. The Speech-to-Text Model processes spoken words and converts them into text data well understood by the Large Language Model.  The text data supplied by the Speech-to-Text Model is processed by the Large Language Model. This is done to understand the user's intent and generate an appropriate response. Finally, the generated text response is then converted back into speech by the Text-to-Speech model. This action produces the final audio output the user hears.

Several voice assistants are beginning to adopt integrated systems. An example of this could be the new GPT-4o model, which aims to handle speech-to-text, processing, and text-to-speech within a single model. However, this approach is still in its early stages. The majority of today's AI voice assistants operate using a pipeline that is a combination of these three separate models. 

This approach is especially beneficial for those who want to build custom AI voice assistants. This advantage lies in allowing the combination of technologies from different sources, rather than being restricted to a single source. Essentially, we can pick and choose which versions of the three models we want to use. Afterward, we can integrate them into a pipeline.

Article continues below

What are Speech-to-Text Models

A Speech-to-Text model is the first part of every AI voice assistant pipeline. Although we refer to it as a model, it is a pipeline. It consists of multiple steps that generate the input needed for Large Language Models. These models are unable to process human voice directly and only work with text (more precisely, numerical representations of text). Therefore, the initial task is to convert spoken words into text. This requires a model capable of "understanding" speech well enough to translate it into text.

If you are a regular user of Microsoft Word, you might already be familiar with a Speech-to-Text model through the Dictate tool, which converts spoken words into accurate text. This is one application of a Speech-to-Text model. These models are widely used whenever the human voice needs to be converted into text, including:

  • transcribing meetings and interviews
  • optimizing call center performance
  • automatic generation of video captions and subtitles
  • controlling smart devices with your voice
  • real-time language translation

Different Speech-to-Text models use different technologies, but they all follow the same basic principle. 

To begin with, audio input is captured using devices like microphones, smartphones, or other recording equipment. Once captured, the audio is usually processed before being fed into the model for text conversion.

Preprocessing consists of removing noise from the recording to ensure the highest possible audio quality. The preprocessed audio is then analyzed to extract features representing the speech signal. Feature extraction is essential because it converts raw audio data into a suitable format for machine learning models. 

A common technique in feature extraction is the calculation of Mel-Frequency Cepstral Coefficients (MFCCs) which represent the short-term power spectrum of sound. Moreover, they effectively capture the important characteristics of speech. MFCCs allow for a compact representation of an audio signal without losing essential information.

The extracted features are then input into an acoustic model, usually an artificial neural network. Various types of neural networks can process these features. There is no single type that is universally being used. After processing the features, the results are delivered to a language model. Based on the processed features, the language model predicts the most likely sequence of words, leveraging artificial neural networks for this task.

Finally, the text output undergoes post-processing to improve its quality. This includes grammatical corrections, punctuation insertion, formatting adjustments, as well as other refinements. The result is the final output of the speech-to-text pipeline, which can then be passed to a Large Language Model for processing.

What Are Large Language Models

Large Language Models are highly advanced artificial neural networks designed to understand, generate, and manipulate human language. Specifically, they are sophisticated adaptations of the Transformer neural network architecture. While the original Transformer models were already large in terms of tunable parameters, modern Large Language Models have vastly surpassed them. These contemporary models often feature over a trillion parameters. This immense scale is the factor behind the "Large" part in their name.

These models have numerous applications across various fields, including text generation, translation, text summarization, and more. Their true strength lies in their versatility and ease of integration into any system interacting with textual data. A Large Language Model has the potential to greatly enhance performance and efficiency for tasks, such as generating coherent text, translating languages, or summarizing documents.

In AI voice assistants, the Large Language Model serves as the "brain" of the system. After the speech-to-text system transcribes the user's spoken words, the Large Language Model is the component that "understands" the input and generates a coherent response. Using a high-quality Large Language Model is arguably more important than having the best speech-to-text or text-to-speech systems. 

High-quality Large Language Models can understand even poorly transcribed sentences, compensating for any shortcomings in the speech-to-text component. Similarly, the substance of the response is more important than the exact voice delivering it. Therefore, even if the text-to-speech system has imperfections, its ability to communicate the correct response is what matters most.

There are many Large Language Models to choose from. The main decision comes down to whether we want to use a proprietary Large Language Model or an open-source option. 

Proprietary models are developed and maintained by private organizations. Their source code, training data, and model weights are not publicly accessible. Access is usually provided through paid APIs or software licenses. These models are generally of the highest quality. However, based on the amount of work assigned to your AI voice assistant, these models can be quite expensive. Some popular proprietary Large Language Models are:

  • GPT-4 and GPT-4o
  • Gemini
  • Cohere

The aforementioned models, especially GPT-4, represent the top echelon of Large Language Models. However, this fact should not discourage the people who cannot afford to spend a great amount of money on such models. Several open-source options are especially good and can almost match the performance of these proprietary models. Some of these include:

  • LLaMa 2
  • Falcon
  • GPT-NeoX

You can use one of the aforementioned models as the "brain" of your AI voice assistant. This would effectively meet the needs of anyone looking to build an AI voice assistant.

What Are Text-to-Speech Models

To put it simply, text-to-speech models are systems that convert written text into spoken words. They were initially developed as assistive technology to help those with reading difficulties. Today these tools are available online and built into many devices.

In the context of AI voice assistants, text-to-speech is the least important component. Indeed, it would be ideal for the generated voice to sound as natural as possible, capturing the nuances of human speech. However, the primary concern of text-to-speech models is to effectively convey the message created by the Large Language Model. Provided that the message is communicated clearly, minor imperfections in voice quality are usually acceptable, making this aspect of the AI voice assistant the least prioritized.

For people looking for a highly natural-sounding AI voice assistant, voice cloning is likely the simplest option. This process involves creating a synthetic replica of a person's voice. Producing a high-quality copy of someone's voice has become quite a simple task. All you need is to possess sufficient recorded material. While voice cloning is relatively easy, it deserves its detailed article, thus we won't delve into it here.

When building an AI voice assistant, you can choose from a range of popular free and open-source text-to-speech models. The most popular ones are:

  • Tacotron 2
  • Mozilla TTS
  • MaryTTS

In conclusion, AI voice assistants have evolved from basic chatbots to sophisticated tools. Thanks to the development of Large Language Models, they are capable of perfectly, coherent interactions. Understanding the core components—Speech-to-Text, Large Language Models, and Text-to-Speech—reveals the intricacies of these systems. It also highlights the flexibility available when building custom solutions. By leveraging diverse technologies for each component, anyone with basic programming knowledge can create a functional and efficient AI voice assistant tailored to their specific needs. In future articles, we will explore in detail how to build your own AI voice assistant using Python and free, open-source models.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.