I'm attempting to convert raw audio data from one format to another for the purposes of voice recognition.
- The audio is received from a Discord server in
20ms
chunks in the format:48Khz, 16-bit stereo signed BigEndian PCM
. - I'm using CMU's Sphinx for voice recognition, which takes audio as an
InputStream
inRIFF (little-endian) WAVE audio, 16-bit, mono 16,000Hz
Audio data is received in a byte[]
with length 3840
. This byte[]
array contains 20ms
of audio in format 1 described above. That means that 1 second of this audio is 3840 * 50
, which is 192,000
. So that's 192,000
samples per second. This makes sense, 48KHz
sample rate, times 2 (96K samples) because a byte is 8 bits, and our audio is 16 bit, and times an additional two for stereo. So 48,000 * 2 * 2 = 192,000
.
So I first call this method every time an audio packet is received:
private void addToPacket(byte[] toAdd) {
if(packet.length >= 576000 && !done) {
System.out.println("Processing needs to occur...");
getResult(convertAudio());
packet = null; // reset the packet
return;
}
byte[] newPacket = new byte[packet.length + 3840];
// copy old packet onto new temp array
System.arraycopy(packet, 0, newPacket, 0, packet.length);
// copy toAdd packet onto new temp array
System.arraycopy(toAdd, 0, newPacket, 3840, toAdd.length);
// overwrite the old packet with the newly resized packet
packet = newPacket;
}
This will just add new packets onto one big byte[] until the byte[] contains 3 seconds of audio data (576,000 samples, or 192000 * 3). 3 seconds of audio data is enough time (just a guess) to detect if the user said the bot's activation hot word like "hey computer.". Here's how I convert the sound data:
private byte[] convertAudio() {
// STEP 1 - DROP EVERY OTHER PACKET TO REMOVE STEREO FROM THE AUDIO
byte[] mono = new byte[96000];
for(int i = 0, j = 0; i % 2 == 0 && i < packet.length; i++, j++) {
mono[j] = packet[i];
}
// STEP 2 - DROP EVERY 3RD PACKET TO CONVERT TO 16K HZ Audio
byte[] resampled = new byte[32000];
for(int i = 0, j = 0; i % 3 == 0 && i < mono.length; i++, j++) {
resampled[j] = mono[i];
}
// STEP 3 - CONVERT TO LITTLE ENDIAN
ByteBuffer buffer = ByteBuffer.allocate(resampled.length);
buffer.order(ByteOrder.BIG_ENDIAN);
for(byte b : resampled) {
buffer.put(b);
}
buffer.order(ByteOrder.LITTLE_ENDIAN);
buffer.rewind();
for(int i = 0; i < resampled.length; i++) {
resampled[i] = buffer.get(i);
}
return resampled;
}
And finally, attempt to recognize the speech:
private void getResult(byte[] toProcess) {
InputStream stream = new ByteArrayInputStream(toProcess);
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
}
The problem I'm having is that CMUSphinx
doesn't crash or provide any error messages, it just comes up with an empty hypothesis every 3 seconds. I'm not exactly sure why, but my guess is that I didn't convert the sound correctly. Any ideas? Any help would be greatly appreciated.