Speech is produced using the lungs, vocal tract, and mouth. The amount of air from the lungs affects the loudness or volume. The vibration of the vocal cords determines whether the voice is high or low, which is also called the pitch or frequency. The teeth, tongue and lips can create turbulence in the air flow. In general, the vocal cords and mouth shape produce the vowel sounds, while the teeth, tongue and lips produce the consonant sounds. Some sounds, like "v", are a combination; "v" is an "f" with vocal cords vibrating. If you place your fingers lightly on your vocal cords, you can feel them vibrate as you speak.
When we record speech, the microphone picks up vibrations from our voice, and records those as a signal. When we play the recorded speech, the signal creates vibrations in the speakers that recreate the voice vibrations. (Now we know why they are called speakers!) The same thing happens when we make a phone call. Our speech is converted to a signal, and the signal is transmitted to the other phone, where the speaker recreates the speech sounds from the signal.
What does the speech signal look like? Imagine a vibrating eardrum, and plotting the displacement as a function of time. We will see the biggest displacements for vowel sounds. We will also see the turbulence caused by consonants, but the amplitude will be much smaller.
Just as the human brain can interpret the sounds, a computer can process the speech signal, and recognize different sounds. This can be tricky, because no two people pronounce words exactly the same way. ("You say po-tay-to, I say po-tah-to..."). If the computer program only has to tell the difference between a few commands, the task is easier. Today many businesses use automated phone answering systems, where the caller selects a menu option.
The more words the computer program has to interpret, the harder the problem is. Did you know there are over 1 million words in the English language? Think of the number of different languages and all of the different dialects. And, with over 6 billion people in the world having different lungs, vocal cords, and mouths, each person will sound different.
Amazingly, there are computer programs that can convert speech to text, but they do make mistakes, which have to be corrected. Today we will experiment with some software called Dragon Naturally Speaking, made by the company Nuance. The software understands commands like "New Line," and also has commands for correcting mistakes. We will not try to correct mistakes today, but if only a single person uses the software and uses Dragon commands to make corrections, the software learns the speaker's pronunciation and becomes more accurate over time.
This leads us to another possibility. If each person has different pronunciation, then it should be possible for a computer program to tell the difference between different speakers. This is called speaker recognition (as opposed to speech recognition). You may have heard of voiceprints on Star Trek. One possibility is to have a person speak a pass phrase to enter an area. This is mostly for movies, because there are other options that are more reliable for access control.
Speaker recognition is more commonly used by the government and legal system. For instance, if a prisoner is under house arrest, they may receive phone calls periodically to verify that they have not left the house; a computer can verify their voice. Another use for speaker recognition is forensics. Remember that forensics refers to scientific evidence used in court. If evidence includes a recording of speech, it may be important to scientifically verify who was speaking. Also, speaker recognition is useful for wire taps or electronic eavesdropping; a computer program can be used to find speech of interest, or speech which may be a match to a particular speaker. You can probably guess that speaker recognition, in general, is a very hard problem, but when we can narrow down the possible number of speakers, it can be useful.
We can compute the frequency of a part of the signal, which is one metric that is useful in comparing speakers. With enough math and processing, it would be possible to recognize a speaker from the speech signal, just as the human brain can recognize voices.