You are here

VoW! A technology to warp voice

A team of researchers from the Indian Institute of Science (IISc) have come up with a new technology that can ensure fine synchronisation between lip movement and video for voice dubbing applications in movies. Called ‘Voice Warping,’ (in short, VoW) the technology can do many other things as well: from helping you learn a new language or Carnatic music lessons, to making radio/television advertisements cheaper. It is developed by Prof. Chandra Sekhar Seelamantula who heads the Spectrum Lab at the Department of Electrical Engineering, IISc, Bangalore.

“A highlight of the technology is that the quality of the output is superior to the competitive techniques known to scientists working in the area. It is language independent and can be readily integrated into most standard media players such as the Windows Media Player or Quick Time Player, voice over internet applications (VoIP) such as Skype, Vqube, Google Talk, text-to-speech synthesis (TTS) applications such as the iPhone Siri, and tablet-based dictation tools,” says Prof.  Chandra Sekhar Seelamantula.

In a nutshell, they have developed a new technique to alter the pace of speech or vocals, without affecting the quality, richness, and speaker identity. However, with some minor modification, the same technique can also be used for voice morphing (adult to child, male to female voice and vice versa, etc.) without changing the pace.

The human voice is the result of an impeccable symphony among different organs in the body. It is an intricate process that humans have perfected over several thousands of years. But, we rarely stop and ask how we are capable of doing this almost automatically.

Every word we speak, every song we sing, and every laugh we share with friends, starts off as a train of puffs of air released by the lungs. These packets of air, on their way up, hit a tiny organ called the larynx or voice box, and cause it to vibrate rapidly, about a few hundred times per second. The vibrations produce a feeble sound, which is strengthened by the throat, nose, and mouth. The amplified sound is the voice we hear and communicate with. It is a process so complex that it took us evolutionary time scales to learn to communicate with sounds, speech, and language.

“The new technique is based on a deeper understanding of how humans speak and sing with the help of their vocal apparatus comprising mainly the lungs, glottis, vocal folds, vocal tract, oral, and nasal cavities,” explains Prof. Chandra Sekhar Seelamantula. His technology is based on “Digital Signal Processing,” which is at the heart of most electronic gadgets that we use in our daily lives.

Prof. Chandra imagines a lot of applications for his invention. “The applications are numerous,” he quips,  “It can help kids and people with speaking disabilities to properly articulate and pick up the correct pronunciation for a word.” The technique also gives the elderly and hearing-disabled a choice to ‘slow down’ a normal talk, and perceive the details.

“In radio/television advertisement industry, where the advertisers pay by the second, the technology is useful for fitting a normally spoken advertisement exactly to a given time slot (remember the high-speed “mutual fund investments” disclaimer). Currently, many takes are required to fit an advertisement to a given time slot, and to perform lip synchronization. The technology greatly reduces the effort required to meet this objective,” he says.

“In future, we plan to develop VoW apps for smart phones so that it becomes more accessible to a wider section of the consumer society. We also plan to work with the industry to customize our technology to specific applications,” he added.

Contact information:

For demonstrations, licensing rights, or consultation, contact Prof. Chandra Sekhar Seelamantula at chandra.sekhar@ee.iisc.ernet.in (or) chandra.sekhar@ieee.org.