I need to train a Bidirectional LSTM model to recognize discrete speech (individual numbers from 0 to 9) I have recorded speech from 100 speakers. What should I do next? (Suppose I am splitting them into individual .wav files containing one number per file) I will be using mfcc as features for the network.
Further, I would like to know the difference in the dataset if I am going to use a library that support CTC (Connectionist Temporal Classification)