Speech
Essay by 24 • March 5, 2011 • 2,711 Words (11 Pages) • 1,073 Views
A Speech Recognition Project
Abstract
Voice Recognition is a facinating field spanning several areas of computer science and mathematics. Reliable speech recognition is a hard problem, requiring a combination of many techniques, however modern methods have been able to achieve an impressive degree of accuracy. This project attempts to examine those techniques, and to apply them to build a simple voice recognition system. The project was started with three goals in mind. First, to be able distinguish 'yes' from 'no'. Second, to be able to recognize a vocabulary of 20 words, spoken individually. And third, to be able to recognize combinations of two or more words from this vocabulary spoken in close succession. The project is implemented in Matlab and was successful in achieving the first goal. It has been able to differentiate between a spoken 'yes' and a spoken 'no' with 100% accuracy among 24 samples taken from 8 different people. The method used is a simple one, involving a simple count of the frequency of zero crossings, but it is quite applicable to the voice recognition problem in general.
The Basic Steps
The process of voice recognition is typically divided into several well defined steps. Different systems vary on the nature of theses steps, as well as how each step is implemented, but the most successful systems follow a similar methodology.
Divide the sound wave into evenly spaced blocks
Process each block for important characteristics, such as strength across various frequency ranges, number of zero crossings, and total energy.
Using this charateristic vector, attempt to associate each block with a phone, which is the most basic unit of speech, producing a string of phones.
Find the word whose model is the most likely match to the string of phones which was produced.
Step 2 typically involves performing a spectrum analysis of the block. This can be done with a Fast Fourier Transform (FFT), or with a bank of frequency filters, but the most successful technique to date has been that of Linear Precidive Coding. Additional important features include analyzing the total energy, the change in the features over time, and the number of zero crossings. Step 3 is often done via a decision tree. Each phone often has very prominent characterstics which narrow the field of consideration. Additional characteristics then separate similar sounding phones. The final decisions are often mistaken, and these mistakes must be accounted for later. Step 4 has been implemented with a high degree of success using Hidden Markov Models (HMM's). A HMM is constructed for each word in the vocabulary, and then the string of phones is compared against each HMM, to determine which model is the most likely match.
This project implements steps 1 and 2. In step 2 the program extracts the zero crossing count. The maximum count over all blocks is then taken, which is sufficient to detect the precense or absence of an unvoiced consonant. Because 'yes' contains the unvoiced consonant 's' and 'no' does not contain an unvoiced consonant, this is able to distinguish between 'yes' and 'no' with a high degree of accuracy. See zerocross.m for the algorithm used to extract the zero crossing count in a given block.
A List of Phones
Phone Example
Vowels
IY beat
IH bit
EY bait
EH bet
AE bat
AA Bob
AH but
AO bought
OW boat
UH book
AX about
IX roses
ER bird
AXR butter
AW down
AY buy
OY boy
Consonants
Y you
W wit
R rent
L let
M met
N net
NX sing
P pet
T ten
K kit
B bet
D debt
G get
HH hat
F fat
TH thing
S sat
SH shut
V vat
DH that
Z zoo
ZH azure
CH church
JH judge
WH which
EL battle
EM bottom
EN button
DX batter
The Experiment
Numerous samples were take of various people saying either 'yes' or 'no'. This method is somewhat artificial in that a real system first has to detect whether speech exists at all (this problem is the separate task of speech detection). Therefore I implemented a criteria for the detection of speech (explained later). This this turned out to be useful for the removal of the empty header and trailer on each voice sample. The method is also somewhat artificial, because in fluent speech words tend to run together and the word boundaries are not obvious. I intended to adress this problem
...
...