Studying the deep neural networks, specifically the LSTM, I decided to follow the idea proposed in this link: Building Speech Dataset for LSTM binary classification to build a classifier.
I have an audio-based, where the features to extract MFCC, where each array is 13x56 each phoneme of a word. training data would be like this:
X = [[phon1fram[1][1], phon1fram[1][2],..., phon1fram[1][56]],
[phon1fram[2][1], phon1fram[2][2],..., phon1fram[2][56]], ....
[phon1fram[15][1], phon1fram[15][2], ..., phon1fram[15][56] ] ]
...
...
[[phon5fram[1][1], phon5fram[1][2],..., phon5fram[1][56]], ... ,
[phon5fram[15][1], phon5fram[15][2], ..., phon5fram[15][56]] ]
in lettering which is certainly the first frames labels would be said as "intermediaries" and only the last frame actually represent the phoneme?
Y = [[0, 0, ..., 0], #intermediary
[0, 0, ..., 0], ... , #intermediary
[1, 0, ..., 0]] # is one phoneme
[[0, 0, ..., 0], ... #intermediary
[0, 1, ..., 0] # other phoneme
This would be really correct? During the first tests I performed all my outlets expected tended to label this "middleman" for being the most prevalent. Any other approach could be used?