- Term Papers and Free Essays


Essay by   •  March 5, 2011  •  2,711 Words (11 Pages)  •  1,125 Views

Essay Preview: Speech

Report this essay
Page 1 of 11

A Speech Recognition Project


Voice Recognition is a facinating field spanning several areas of computer science and mathematics. Reliable speech recognition is a hard problem, requiring a combination of many techniques, however modern methods have been able to achieve an impressive degree of accuracy. This project attempts to examine those techniques, and to apply them to build a simple voice recognition system. The project was started with three goals in mind. First, to be able distinguish 'yes' from 'no'. Second, to be able to recognize a vocabulary of 20 words, spoken individually. And third, to be able to recognize combinations of two or more words from this vocabulary spoken in close succession. The project is implemented in Matlab and was successful in achieving the first goal. It has been able to differentiate between a spoken 'yes' and a spoken 'no' with 100% accuracy among 24 samples taken from 8 different people. The method used is a simple one, involving a simple count of the frequency of zero crossings, but it is quite applicable to the voice recognition problem in general.

The Basic Steps

The process of voice recognition is typically divided into several well defined steps. Different systems vary on the nature of theses steps, as well as how each step is implemented, but the most successful systems follow a similar methodology.

Divide the sound wave into evenly spaced blocks

Process each block for important characteristics, such as strength across various frequency ranges, number of zero crossings, and total energy.

Using this charateristic vector, attempt to associate each block with a phone, which is the most basic unit of speech, producing a string of phones.

Find the word whose model is the most likely match to the string of phones which was produced.

Step 2 typically involves performing a spectrum analysis of the block. This can be done with a Fast Fourier Transform (FFT), or with a bank of frequency filters, but the most successful technique to date has been that of Linear Precidive Coding. Additional important features include analyzing the total energy, the change in the features over time, and the number of zero crossings. Step 3 is often done via a decision tree. Each phone often has very prominent characterstics which narrow the field of consideration. Additional characteristics then separate similar sounding phones. The final decisions are often mistaken, and these mistakes must be accounted for later. Step 4 has been implemented with a high degree of success using Hidden Markov Models (HMM's). A HMM is constructed for each word in the vocabulary, and then the string of phones is compared against each HMM, to determine which model is the most likely match.

This project implements steps 1 and 2. In step 2 the program extracts the zero crossing count. The maximum count over all blocks is then taken, which is sufficient to detect the precense or absence of an unvoiced consonant. Because 'yes' contains the unvoiced consonant 's' and 'no' does not contain an unvoiced consonant, this is able to distinguish between 'yes' and 'no' with a high degree of accuracy. See zerocross.m for the algorithm used to extract the zero crossing count in a given block.

A List of Phones

Phone Example


IY beat

IH bit

EY bait

EH bet

AE bat

AA Bob

AH but

AO bought

OW boat

UH book

AX about

IX roses

ER bird

AXR butter

AW down

AY buy

OY boy


Y you

W wit

R rent

L let

M met

N net

NX sing

P pet

T ten

K kit

B bet

D debt

G get

HH hat

F fat

TH thing

S sat

SH shut

V vat

DH that

Z zoo

ZH azure

CH church

JH judge

WH which

EL battle

EM bottom

EN button

DX batter

The Experiment

Numerous samples were take of various people saying either 'yes' or 'no'. This method is somewhat artificial in that a real system first has to detect whether speech exists at all (this problem is the separate task of speech detection). Therefore I implemented a criteria for the detection of speech (explained later). This this turned out to be useful for the removal of the empty header and trailer on each voice sample. The method is also somewhat artificial, because in fluent speech words tend to run together and the word boundaries are not obvious. I intended to adress this problem



Download as:   txt (15 Kb)   pdf (170.6 Kb)   docx (15.4 Kb)  
Continue for 10 more pages »
Only available on