Making sense of sound - Digital Audio Processing Lab at IIT Bombay

Bengaluru

16 Apr 2020

Making sense of sound - Digital Audio Processing Lab at IIT Bombay

“You are going."

As a text, these may be just three words. With no associated tone, it may merely indicate that the person being spoken to is going. But attach a tone, and it could mean a command, or could express emotions such as surprise or disappointment!

Humans communicate a lot non-verbally, thanks to the ability of our brain to understand tone. Would computers be able to do this? Currently capable of understanding plain text, they are struggling to learn the emotions behind the words, conveyed through tone. But these machines are catching up fast. Digital audio processing tools equip computers to understand various information in sound, including emotions.

Prof Preeti Rao of the Indian Institute of Technology Bombay is an expert in sound processing—an approach that helps us do various useful things, one of which is removing unwanted sounds (or noise) from an audio clip. With her team at the Digital Audio Processing Lab, Prof Rao attempts to understand the nature of sound, reveal the information it may hold, and use it for, say, identifying tracks, melodies or the raga of a song, Prof Rao attempts to understand the nature of sound, reveal the information it may hold, and use it for, say, identifying tracks, melodies or the raga of a song.

Sound, when converted to an electrical signal, tells us much through the shape of its waveform, the amplitude of its constituent frequencies (frequency spectrum), and other features like pauses.

“It is quite like an ECG or electrocardiogram. A doctor can look at a person’s ECG and understand the condition of his/her heart. With the help of mathematical and computational tools, it is possible to analyse an ECG automatically too. Similarly, we can use these tools to extract useful information from the sound signal,” explains Prof Rao.

Her team works on two broad areas; speech processing and musical information research.

Speech recognition is the analysis and processing of spoken words for human-machine interfaces like robots or electronic assistants such as Siri or Alexa. A computer can identify words and phrases by digitally comparing the waveform or the frequency spectrum of the input signal with a reference. Researchers are using reference databases and artificial intelligence to extend the capabilities of the computer to identify even emotions in speech.

Prof Rao’s team works on the analysis of sound to recognise tone, pace and stress on words, and algorithms to reduce noise in audio signals. The team has developed various useful applications, such as teaching aids for learning a new language and a mobile app to evaluate children’s understanding of the text while reading.

Digital analysis of musical audio can help identify melodic patterns, tempi and rhythmic features. This information can be used to find musical pieces, search large musical databases or navigate through a long recording and even to make recommendations in online music streaming applications such as YouTube Music and Spotify. Musicologists and historians of music, who wish to study how music evolved, or musical styles of various genres, analyse music mainly through written scores. Digital processing tools equip them with the intelligence required to analyse large corpora of data.

Application of digital processing tools to analyse Hindustani classical music is even more significant. Unlike western music, Hindustani music does not have a strictly written score, and often, analysis of recorded audio is the only practical way to study it.

“Digital analysis of thousands of recorded concerts opens up a possibility of in-depth analysis,” says Prof Rao.

With her team, she works on various aspects of Hindustani classical music, such as analysis of recorded performances, the tonal quality of instruments and goodness of tabla bols.

Dealing with Degraded Speech

“Alexa, add tissues to the shopping list.”
Alexa: “Adding two shoes to the shopping list.”
This conversation would be hilarious if it were not frustrating! Anyone who has used an electronic assistant has, at least once, experienced the challenge to get it to understand a command.

Electronic assistants get commands from a distance, usually from a few feet away, and ‘hear’ them using the built-in microphones. They identify words using automatic speech recognition software inside them. In the presence of disturbances in the surroundings, the speech recognition software cannot identify the command correctly. Various sounds like opening or shutting of doors, rotating fan, or movement of chairs can cause ‘disturbance’ or noise. Sound reflects from the roof and the walls of the room and causes multiple, delayed copies of the same sound signal. This phenomenon, called reverberation, fools the speech recognition software.

Image credits - Prof. Preeti Rao

An array of microphones can counter the effects of noise and reverberation.

“One of the cues humans use to eliminate noise is directionality. Using an array of microphones and analysing the signals received from them as a set, can help improve signal quality greatly,” explains Prof Rao.

Together with her colleague, Prof. Rajbabu, her team has developed a system that can reduce reverberation as well as other environmental noise better than previously reported methods while retaining the signal’s information content.

Assessing Spoken Language Fluency

The proper use of stress and intonation in one’s speech makes it easy to understand what is communicated. An exciting work from Prof Rao’s group is an application called ‘Spoken Language Fluency Assessment’ that can help teachers assess a student’s understanding of a given text.

“We need large scale, objective methods to monitor literacy and teach languages. This is the motivation behind our work,” comments Prof Rao.

Image credits - Prof. Preeti Rao

The input to the application is an audio recording of a student reading a text passage appropriate for their grade. The application scores the reading on multiple rubrics obtained from human assessment protocols. It measures the accuracy of words, reading pace and expressiveness with reference to the provided text using speech recognition and models for stress and intonation. The system “learns” to emulate a human expert via training on thousands of annotations obtained from language teachers. The team is currently addressing challenges such as dealing with diversity in environments, accents and speaking skills.

Nuances of Hindustani Classical Music

Hindustani classical music is traditionally taught in a guru-shishya tradition, where students practice in the presence of their guru repeatedly until they get the notes and melodies right. Ragas, or melodic sequences, performed in Hindustani classical music are defined by raga grammar—the use of specific notes, note sequences and characteristic ornamentation.

“Typically, the theory is neither explicitly taught nor is it adequate for delivering a creative performance that adheres to the raga grammar and aesthetics,” remarks Prof Rao.

Prof Rao and her students developed a computational representation for raga grammar that can be applied to analyse audio recordings of performances. It could be useful for musicologists and as a teaching aid. Further, the analyses of performance recordings on the dimensions of tempo and rhythm have helped to create rich annotations that can be used for search and navigation within music archives.

Prof Rao and her team are also working on determining the goodness of a tabla bol based on a recorded reference database. When further refined, it could help students determine if what they play while practising is correct or not.

“Practicing with a guru daily may not be possible for students in today’s times. Such tools can act as good complementary tools for self-learning,” remarks the award-winning Hindustani classical musician Mahesh Kale.

Another of Prof Rao’s works features a melody-based classification of musical styles. Based on the pitch and energy characteristics extracted from the audio signals, the software could identify the music as Hindustani, Carnatic or Turkish while providing interesting insights on what exactly binds or separates these distinct traditions.

Input from experts from diverse backgrounds is crucial to Prof Rao’s work.

“Digital processing is just the mathematical processing of signals. We need inputs of experts in the field to make it useful. For example, it is essential to build a reference database that holds tabla bols, perceived as good or bad, by experts,” she notes. “We are not finding anything that musicians don’t know, we are just making it explicit,” she remarks.

Prof Rao continues to work on projects related to speech processing and music information retrieval and is also collaborating with several international researchers to use digital audio processing for multiple new applications in speech and music.

This article has been run past the researchers, whose work is covered, to ensure accuracy.