Doing phoneme recognition given different sized audio files?

Question

I am currently working on doing phoneme recognition with cnn.

My dataset is labeled, but I am bit unsure how i ensure that the length of feature vector also will be according to the length of the audio file.

My input to the CNN is currently a spectogram visualisation of mel-log filter energies, where the y-axis are the different frequency bands, and the x-axis is the contains the frame.

For the given example above is the sentence:

fmjc-b-an118 RUBOUT J L Y Z TWO

And phonemes:

RUBOUT: R AH B AW T

J: JH EY

L: EH L

Y: W AY

Z: Z IY

TWO: T UW

In total 15 phonemes in 249 frames. Nearly 17 frames pr. each phoneme.

but here:

is the text/word spoken to it :

fbbh-b-an90 NO
NO: N OW

In total 2 phonemes in 97 frames = 49 frames per phoneme.

So how can i create an input shape that captures number phonemes a audio file would have?

Edit:

The only way i think I think it is possible to recreate the input/output relationship is to provide an input shape that is one frame, but will the system be able to detect the different classes of phoneme in the that short time span, and still say "None" if none is available?

This would require the output shape contain the classes for each frame, which require me to know the duration of each phoneme which should be possible with this.

But again is it possible to detect a phoneme given one frame?

Possible duplicate of [How to train on and make a serialized feature vector for a Neural Network?](http://stackoverflow.com/questions/19419098/how-to-train-on-and-make-a-serialized-feature-vector-for-a-neural-network) — Nikolay Shmyrev, Mar 20 '17 at 03:50
No, it is not possible to reliably detect phoneme using a single frame. Modern system use concatenated 20-40 frames for detectors. Usually you concatenate 30 frames before and 10 frames after. — Nikolay Shmyrev, Mar 21 '17 at 15:25

score -1 · Answer 1 · answered Mar 19 '17 at 20:02

I have a suggestion, I don't think it's necessarily a good one but I do think it would work.

If what you are trying to do is train so that phonemes will be recognised regardless of the number of frames they span, you could try time-scaling your training features by a few random coefficients. This is done on a few feature extractors in OpenCV to make image features scale-invariant. I think if you applied it to audio it might make it speed-invariant. I realise this might cause your # of training features to explode, so an alternative approach would be to scale the inputs you are trying to recognize, rather than the training ones. You could perhaps scale all your training features to the same frames/feature rate, and then scale all inputs to the same rate. It may be that this is entirely impossible, I'm not a machine learning expert. good Luck

I am not quite sure i understand your process. Are you suggesting that i should reshape the input such that there is a relation between input/output? — J.Down, Mar 19 '17 at 20:39

Doing phoneme recognition given different sized audio files?

1 Answers1