Microsoft Speech Platform - sampling rate and bit depth

Question

Recognition results are best if sampling rate and bit depth of the audio match the training data of the system.

So, does anyone know the exact sampling rate and/or bit depth (and/or stereo/mono) that is used in Microsoft Speech Platform (newest, if that's important)? And if so, do you remember where you got this information?

Please note that I am using the MS Speech Platform, not the SAPI. Unless both are using the same training data, that's not the same AFAIK. To be precise - I use this: http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.speechrecognitionengine.setinputtowavefile%28v=office.14%29.aspx

My first try is based upon the C++ code example given on the page.

score 0 · Accepted Answer · edited May 23 '17 at 11:56

0

The Microsoft.Speech SR engine doesn't need training (unlike the System.Speech SR engine), and is relatively insensitive to sampling rate (will work with anything > 8 KHz sampling rate). 16 bit audio is preferred, but I believe that it will work with 8 bit audio.

edited May 23 '17 at 11:56

Community

1
1

answered Aug 10 '13 at 16:35

Eric Brown

13,774
7
30
71

It may work with practically everything, but from what I know speech recognition systems work best when used with the same sample rate/bit depth they had for training. To clarify: I don't intend to train the system, I try to decide on the optimal format of the material that's to be recognized. – Icarus Aug 12 '13 at 07:46
Microsoft.Speech is built on top of 8 KHz 16 bit audio. That being said, Microsoft.Speech is pretty insensitive to audio quality. – Eric Brown Sep 12 '13 at 03:42
We run some tests on sample material - it seems the optimal setup for our purposes is 16 kHz 16 bit. This surprises me a bit. How do you know what MS SAPI is built on? I did not find that information anywhere. Can you give a link? – Icarus Sep 12 '13 at 09:47
Oops. Yeah, that explains. Thank you. – Icarus Sep 13 '13 at 07:15

score 0 · Answer 2 · answered Jan 03 '18 at 11:13

I couldn't find any information regarding sample rate, but it seems the bit depth is actually 8-bit (maybe this has changed since Eric Brown's answer).

Quoted from this page listing supported audio formats:

The Speech Platform downsamples audio that is of greater than 8-bit resolution.

You should be fine providing any bit-depth which is a multiple of 8-bits (which is always the case anyway), since there will be no precision loss due to rounding (and there is no aliasing for resolution, unlike sample rate).

Microsoft Speech Platform - sampling rate and bit depth

2 Answers2