I have developed a proof of concept system for sound recognition using mfcc and hidden markov models. It gives promising results when I test the system on known sounds. Although the system, when an unknown sound is inputted returns the result with the closest match and the score is not that distinct to devise it is an unknown sound e.g.:
I have trained 3 hidden markov models one for speech, one for water coming out of water tap and one for knocking on the desk. Then I test them on unseen data and get following results:
input: speech
HMM\knocking: -1213.8911146444477
HMM\speech: -617.8735676792728
HMM\watertap: -1504.4735097322673
So highest score speech which is correct
input: watertap
HMM\knocking: -3715.7246152783955
HMM\speech: -4302.67960438553
HMM\watertap: -1965.6149147201534
So highest score watertap which is correct
input: knocking
HMM\filler -806.7248912250212
HMM\knocking: -756.4428782636676
HMM\speech: -1201.686687761133
HMM\watertap: -3025.181144273698
So highest score knocking which is correct
input: unknown
HMM\knocking: -4369.1702184688975
HMM\speech: -5090.37122832872
HMM\watertap: -7717.501505674925
Here the input is an unknown sound but it still returns the closest match as there is no system for thresholding/garbage filtering.
I know that in keyword spotting an OOV (out of vocabulary) sound can be filtered out using a garbage or filler model but it says it is trained using a finite set of unknown words where this can't be applied to my system as I don't know all the sounds that the system may record.
How is a similar problem solved in speech recognition system? And how can I solve my problem to avoid false positives?