Converting Raw PCM Data to RIFF WAV

Question

I'm attempting to convert raw audio data from one format to another for the purposes of voice recognition.

The audio is received from a Discord server in 20ms chunks in the format: 48Khz, 16-bit stereo signed BigEndian PCM.
I'm using CMU's Sphinx for voice recognition, which takes audio as an InputStream in RIFF (little-endian) WAVE audio, 16-bit, mono 16,000Hz

Audio data is received in a byte[] with length 3840. This byte[] array contains 20ms of audio in format 1 described above. That means that 1 second of this audio is 3840 * 50, which is 192,000. So that's 192,000 samples per second. This makes sense, 48KHz sample rate, times 2 (96K samples) because a byte is 8 bits, and our audio is 16 bit, and times an additional two for stereo. So 48,000 * 2 * 2 = 192,000.

So I first call this method every time an audio packet is received:

private void addToPacket(byte[] toAdd) {
    if(packet.length >= 576000 && !done) {
        System.out.println("Processing needs to occur...");
        getResult(convertAudio());
        packet = null; // reset the packet
        return;
    }

    byte[] newPacket = new byte[packet.length + 3840];
    // copy old packet onto new temp array
    System.arraycopy(packet, 0, newPacket, 0, packet.length);
    // copy toAdd packet onto new temp array
    System.arraycopy(toAdd, 0, newPacket, 3840, toAdd.length);
    // overwrite the old packet with the newly resized packet
    packet = newPacket;
}

This will just add new packets onto one big byte[] until the byte[] contains 3 seconds of audio data (576,000 samples, or 192000 * 3). 3 seconds of audio data is enough time (just a guess) to detect if the user said the bot's activation hot word like "hey computer.". Here's how I convert the sound data:

    private byte[] convertAudio() {
        // STEP 1 - DROP EVERY OTHER PACKET TO REMOVE STEREO FROM THE AUDIO
        byte[] mono = new byte[96000];
        for(int i = 0, j = 0; i % 2 == 0 && i < packet.length; i++, j++) {
            mono[j] = packet[i];
        }

        // STEP 2 - DROP EVERY 3RD PACKET TO CONVERT TO 16K HZ Audio
        byte[] resampled = new byte[32000];
        for(int i = 0, j = 0; i % 3 == 0 && i < mono.length; i++, j++) {
            resampled[j] = mono[i];
        }

        // STEP 3 - CONVERT TO LITTLE ENDIAN
        ByteBuffer buffer = ByteBuffer.allocate(resampled.length);
        buffer.order(ByteOrder.BIG_ENDIAN);
        for(byte b : resampled) {
            buffer.put(b);
        }
        buffer.order(ByteOrder.LITTLE_ENDIAN);
        buffer.rewind();
        for(int i = 0; i < resampled.length; i++) {
            resampled[i] = buffer.get(i);
        }

        return resampled;
    }

And finally, attempt to recognize the speech:

private void getResult(byte[] toProcess) {
    InputStream stream = new ByteArrayInputStream(toProcess);
    recognizer.startRecognition(stream);
    SpeechResult result;
    while ((result = recognizer.getResult()) != null) {
        System.out.format("Hypothesis: %s\n", result.getHypothesis());
    }
    recognizer.stopRecognition();
}

The problem I'm having is that CMUSphinx doesn't crash or provide any error messages, it just comes up with an empty hypothesis every 3 seconds. I'm not exactly sure why, but my guess is that I didn't convert the sound correctly. Any ideas? Any help would be greatly appreciated.

Look at your stereo dropping loop conditions. Run it in a debugger. See what happens. If you want every other value why increase I by one? Also don’t create a new buffer all the time and copy data around, that’s a huge stress on the GC and completely useless. One buffer of the size you want in the end, copy data into it, done. — Sami Kuhmonen, Dec 19 '17 at 04:18

score 0 · Accepted Answer · answered Dec 24 '17 at 05:34

So, there's actual a much better, in-house solution for converting audio from a byte[].

Here's what I found works pretty well:

        // Specify the output format you want
        AudioFormat target = new AudioFormat(16000f, 16, 1, true, false);
        // Get the audio stream ready, and pass in the raw byte[]
        AudioInputStream is = AudioSystem.getAudioInputStream(target, new AudioInputStream(new ByteArrayInputStream(raw), AudioReceiveHandler.OUTPUT_FORMAT, raw.length));
        // Write a temporary file to the computer somewhere, this method will return a InputStream that can be used for recognition
        try {
            AudioSystem.write(is, AudioFileFormat.Type.WAVE, new File("C:\\filename.wav"));
        } catch(Exception e) {}

Converting Raw PCM Data to RIFF WAV

1 Answers1

Linked