11

I have developed a proof of concept system for sound recognition using mfcc and hidden markov models. It gives promising results when I test the system on known sounds. Although the system, when an unknown sound is inputted returns the result with the closest match and the score is not that distinct to devise it is an unknown sound e.g.:

I have trained 3 hidden markov models one for speech, one for water coming out of water tap and one for knocking on the desk. Then I test them on unseen data and get following results:

input: speech
HMM\knocking:  -1213.8911146444477
HMM\speech:  -617.8735676792728
HMM\watertap:  -1504.4735097322673

So highest score speech which is correct

input: watertap
HMM\knocking:  -3715.7246152783955
HMM\speech:  -4302.67960438553
HMM\watertap:  -1965.6149147201534

So highest score watertap which is correct

input: knocking
HMM\filler  -806.7248912250212
HMM\knocking:  -756.4428782636676
HMM\speech:  -1201.686687761133
HMM\watertap:  -3025.181144273698

So highest score knocking which is correct

input: unknown
HMM\knocking:  -4369.1702184688975
HMM\speech:  -5090.37122832872
HMM\watertap:  -7717.501505674925

Here the input is an unknown sound but it still returns the closest match as there is no system for thresholding/garbage filtering.

I know that in keyword spotting an OOV (out of vocabulary) sound can be filtered out using a garbage or filler model but it says it is trained using a finite set of unknown words where this can't be applied to my system as I don't know all the sounds that the system may record.

How is a similar problem solved in speech recognition system? And how can I solve my problem to avoid false positives?

Radek
  • 1,403
  • 3
  • 25
  • 54
  • 2
    I think this should be moved to Cross Validated. – ziggystar Jun 22 '12 at 13:42
  • 3
    I agree this would get more (and better qualified) attention of Cross Validated. Sadly, the bat signal (aka "enough eyeballs with high enough permissions") seems to be turned off, so Radek would have to put it there. (The "belongs on" doesn't have an option for CV or manually specifying where it belongs. Meh.) – Godeke Jun 22 '12 at 15:46

3 Answers3

3

To reject other words you need a filler model.

This is a statistical hypothesis test. You have two hypothesis (word is known and word is unknown). To make a decision you need to estimate a probability of each hypothesis.

Filler model is trained from the speech you have, just in a different way, for example it might be a single gaussian for any speech sound. You compare score from generic filler model and score from the word HMM and make a decision. For more in-depth information and advanced algorithms you can check any paper on keyword spotting. This thesis have a good review:

ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING A. J. Kishan Thambiratnam

http://eprints.qut.edu.au/37254/1/Albert_Thambiratnam_Thesis.pdf

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
1

So what I have done is: I created my simplified version of a filler model. Each hmm representing watertap sound, knocking sound and speech sound is a seperate 6 state hmm trained by sounds from training set of 30, 50, 90 sounds respectively of various lengths 0.3 sec to 10 seconds. Then I created a filler model which is a 1 state hmm consisting od all the training set sounds for knocking, watertap and speech. So if the hmm model score is greater for a given sound than the filler's score - sound is recognized otherwise it is an unknown sound. I don't really have large data but I have perfoormed a following test for false positives rejection and true positives rejection on unseen sounds.

true positives rejection
knocking 1/11 = 90% accuracy
watertap 1/9 = 89% accuracy
speech 0/14 = 100% accuracy


false positives rejection
Tested 7 unknown sounds
6/7 = 86% accuracy

So from this quick test I can conclude that this approach gives reasonable results although I have a strange feeling it may not be enough.

Radek
  • 1,403
  • 3
  • 25
  • 54
0

Discriminative models tend to perform better on classification tasks than generative models.

You could definitely get better performance on this task using a specially designed CRF or a max-margin classifier (structured svm).

This paper (http://ttic.uchicago.edu/~jkeshet/papers/KeshetGrBe07.pdf) discusses a classification problem similar to yours and shows that a max-margin formulation outperforms the generative approach with filler model.

There is probably nothing out-of-the box that can do what I have described, but, with some effort you might be able to extend svm-struct. (The hmm-svm implementation won't work for your problem because you need to specify hidden state structure in advance rather than learn an arbitrarily connected hidden state structure.)

user1149913
  • 4,463
  • 1
  • 23
  • 28