Part2: How does Speech Recognition work?
Speech recognition software is a major part of our lives today. This technology has affected how we conduct our day-to-day tasks.
Some statistics on how integrated this technology is in our lives today
- 74% of the digital services users in 2019, used their voice-based assistant to research and buy products and services, create a shopping list, and check order status (Capgemini study).
- 32% of the global online population in 2020 was using voice search on mobile phones.
- 72% of people in the US use voice search through a digital assistant.
- Nearly 50% of all web search queries were voice activated searches in 2020.
- 2 in 5 people say voice activated devices are essential to their life.
- 55% teens, 44% adults use voice search daily.
- In 2019, around 3.25 billion digital assistants were in use on various devices globally.
- Nearly 2,600 voice apps exist for consumers to download and use.
Digital assistants like Alexa use speech recognition technology to understand what you are saying and to determine what you need. It is this process that enables the device to recognize spoken words, convert into a machine understandable format, and generate a response in desired voice or text based format.
For example: Alexa, Siri respond in voice or text format, where as, transcription programs like Google Dictate convert spoken words into text.
Speech Recognition and AI
The traditional speech recognition frameworks had poor accuracy as they used algorithms that had limited capabilities and could identify only a limited number of words. The challenges for these algorithms were that natural language has many facets like accent, semantics, context, and words from foreign languages and could not adapt to the changes in languages over time. This made the speech recognition system unreliable.
However, with the use of AI and machine learning models, the speech recognition algorithms have improved considerably. ML models can help overcome the challenges faced by traditional algorithms.
- A much larger dataset can be processed
- Higher accuracy compared to traditional models.
- These models are self-learning, i.e., they can keep improving by adapting to changes in language
Today, speech recognition technology is at 95% accuracy, which makes it very close to normal communication.
Traditional Speech Recognition
Traditional speech recognition models used classification algorithms.
The speech input is transformed into a set of features using the following models:
- Language models: Produces a sequence of words
- Pronunciation models: Tells how a particular word is spoken. Its written out as a sequence of tokens (phonemes: basic units of sound)
- Acoustic models: These are defined from how the input token sounds and are characteristics of the audio waveform of the input.
The language models are pre-trained and used to find the most likely sequence of phonemes that could be mapped to output words using tables containing probabilities of token sequences. The speech processing is pre-defined.
When the speed recognition system gets an audio waveform, it computes the features X for it, and does a search for output Y that gives the highest probabilities of having the features X.
Speech Recognition using Deep Neural Network
The use of deep learning and neural networks helped improve the accuracy of speech recognition. Simply put, imagine the deep learning model as a black box system where you feed input of sound recordings or audio files into a neural network and train it to produce text. The model continuously trains itself with incoming data.
A regular (non-recurrent) neural network is a generic machine learning algorithm that takes in input and calculates a result (based on previous training). Neural networks are stateless, ie, it will always generate the same result for the same input. It has no memory of previous calculations.
Recurrent Neural Networks, or RNNs is a version of neural network that is not stateless. RNNs split the input into time steps, and the previous state of the neural network is one of the inputs for the next. Since the calculation from each time step is fed to the next, it gives the model the ability to remember inputs. This allows neural networks to learn patterns in a sequence of data.
This is useful for natural language processing and where there are long term dependencies across sequences as in speech recognition. RNN implements forget and retain gates. This gives the model the ability to remember information in a weighted way. This system is called the Long Short Term Memory or LSTMs.
The purpose of the RNN is to convert an input sequence into a sequence of character probabilities for the transcription.
The input sequence is fed into the RNN. The RNN generates encoding that represents the input as a series of unique values, using Long Short Term Memory.
The traditional language, pronunciation and acoustic models in a conventional ASR(Automatic Speech Recognition) system can all be replaced by neural network models. Even the speech preprocessing step can be replaced with CNN (convolutional neural network).
But each of these models are trained independently which may not work well with each other. To avoid the errors due to multiple models, an end-to-end model was devised where entire model could be trained as one.
The end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, which helps to make the model less complex.
The end to end model uses one RNN as the encoder, that generates the encoding for the input sequence, and a second RNN as the decoder, that takes the encoded input and decodes to get the text.
End-to-end models that are used for speech recognition:
- Connectionist Temporal Classification (CTC)
- Neural Transducer (online Sequence-To-Sequence) that uses CNN(convolutional neural network)
The input X is an audio sequence(a sequence of frames from x1 to xT) that’s encoded with a recurrent neural network. The decoder is also a recurrent neural network that looks at the entire input (encoder) and predicts the output (output is text Y — which is a sequence of y1 to yL).
Divya Sikka is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.