0

Recognition results are best if sampling rate and bit depth of the audio match the training data of the system.

So, does anyone know the exact sampling rate and/or bit depth (and/or stereo/mono) that is used in Microsoft Speech Platform (newest, if that's important)? And if so, do you remember where you got this information?

Please note that I am using the MS Speech Platform, not the SAPI. Unless both are using the same training data, that's not the same AFAIK. To be precise - I use this: http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.speechrecognitionengine.setinputtowavefile%28v=office.14%29.aspx

My first try is based upon the C++ code example given on the page.

LittleBobbyTables - Au Revoir
  • 32,008
  • 25
  • 109
  • 114
Icarus
  • 5
  • 2

2 Answers2

0

The Microsoft.Speech SR engine doesn't need training (unlike the System.Speech SR engine), and is relatively insensitive to sampling rate (will work with anything > 8 KHz sampling rate). 16 bit audio is preferred, but I believe that it will work with 8 bit audio.

Community
  • 1
  • 1
Eric Brown
  • 13,774
  • 7
  • 30
  • 71
  • It may work with practically everything, but from what I know speech recognition systems work best when used with the same sample rate/bit depth they had for training. To clarify: I don't intend to train the system, I try to decide on the optimal format of the material that's to be recognized. – Icarus Aug 12 '13 at 07:46
  • Microsoft.Speech is built on top of 8 KHz 16 bit audio. That being said, Microsoft.Speech is pretty insensitive to audio quality. – Eric Brown Sep 12 '13 at 03:42
  • We run some tests on sample material - it seems the optimal setup for our purposes is 16 kHz 16 bit. This surprises me a bit. How do you know what MS SAPI is built on? I did not find that information anywhere. Can you give a link? – Icarus Sep 12 '13 at 09:47
  • Oops. Yeah, that explains. Thank you. – Icarus Sep 13 '13 at 07:15
0

I couldn't find any information regarding sample rate, but it seems the bit depth is actually 8-bit (maybe this has changed since Eric Brown's answer).

Quoted from this page listing supported audio formats:

The Speech Platform downsamples audio that is of greater than 8-bit resolution.

You should be fine providing any bit-depth which is a multiple of 8-bits (which is always the case anyway), since there will be no precision loss due to rounding (and there is no aliasing for resolution, unlike sample rate).

M57
  • 125
  • 4