High-Level Speaker Verification Via Articulatory-Feature Based Sequence Kernels And Svm
Essay by 24 • November 26, 2010 • 2,689 Words (11 Pages) • 1,399 Views
Essay Preview: High-Level Speaker Verification Via Articulatory-Feature Based Sequence Kernels And Svm
begin{abstract}vspace{-0.06cm}
Articulatory-feature based pronunciation models (AFCPMs) are capable of
capturing the pronunciation variations among different speakers and are good
for high-level speaker recognition. However, the likelihood-ratio scoring
method of AFPCMs is based on a decision boundary created by training the target
speaker model and universal background model (UBM) separately. Therefore, the
method does not fully utilize the discriminative information available in the
training data. To fully harness the discriminative information, this paper
proposes training a support vector machine (SVM) for computing the verification
scores. More precisely, the models of target speakers, individual background
speakers, and claimants are converted to AF-supervectors, which form the inputs
to an AF-based kernel of the SVM for computing verification scores. Results
show that the proposed AF-kernel scoring is complementary to likelihood-ratio
scoring, leading to better performance when the two scoring methods are
combined. Further performance enhancement was also observed when the AF scores
were combined with acoustic scores derived from a GMM-UBM system.
%However, to represent the impostor population, the likelihood-ratio scoring
%method of AFPCMs only uses a single universal background model (UBM) that is
%trained without considering the target speakers; therefore this scoring method
%does not fully utilize the discriminative information available in the training
%data.
end{abstract}
%noindent{bf Index Terms}: Speaker verification, kernels, articulatory
%features, pronunciation models, SVM vspace{-0.1cm}
section{Introduction}label{sec:intro}%vspace{-0.1cm}
Studies have shown that combining low-level acoustic information with
high-level speaker information---such as the usage or duration of particular
words, prosodic features and articulatory features (AF)---can improve speaker
verification performance
cite{Reynolds&Andrew03,Campbell&Reynolds03,Klusacek03,Leung&Mak&Kung06,Zhang&Mak&Meng07}.
However, in most systems (e.g., GMM-UBM cite{Reynolds&Quatieri&Dunn00} and
CD-AFCPM cite{Zhang&Mak&Meng07}), scoring is done at the frame-level, i.e.,
each frame of speech is scored separately and then frame-based scores are
accumulated to produce an utterance-based score for classification. This
frame-based scoring scheme has two drawbacks. First, treating the frames
individually may not be able to fully capture the sequence information
contained in the utterance. Second, the goal of speaker verification is to
minimize classification errors on test utterances rather than on individual
speech frames. These drawbacks motivate us to derive a sequence-based approach
in which an utterance is considered comprising a sequence of symbols and the
utterance-based score can be obtained from a support vector machine (SVM)
through a kernel function of the sequence of symbols.
This paper derives an articulatory-feature based sequence kernel and apply it
to high-level speaker verification. For each target speaker, the observation
sequences (AF labels) derived from his/her utterances are used to train a
phonetic-class dependent articulatory feature-based pronunciation model
(CD-AFCPM) cite{Zhang&Mak&Meng07}. These models are then converted to
fixed-dimension AF supervectors for training a speaker-dependent SVM to
discriminate the target speaker from background speakers in the AF-supervector
space. To enhance the discrimination, a kernel that computes the similarity
between the target speaker's supervector and the claimant's supervector is
derived for the SVM. During verification, the AF labels derived from the speech
of a claimant are used to build a CD-AFCPM of the claimant, which together with
the target speaker model form the inputs to the speaker-dependent SVM to
compute the verification scores. Because the kernel depends on the AF models of
both the target speaker and the background speakers, we refer to it as
AF-kernel.
The remainder of the paper will derive the AF-kernel and discuss the
relationship between traditional frame-based log-likelihood (LR) scoring and
AF-kernel based SVM scoring. Experimental results on the NIST2000 database are
presented.
section{Phonetic-Class Dependent AFCPM}
subsection{Articulatory-Feature Based Supervectors}label{sec:AF_and_AFSuperVector}
Articulatory features (AFs) are representations describing the
movements or positions
...
...