2

I'm trying to feed audio from an online communication app into the Vosk speech recognition API.

The audio comes in form of a byte array and with this audio format PCM_SIGNED 48000.0 Hz, 16 bit, stereo, 4 bytes/frame, big-endian. In order to be able to process it with Vosk, it needs to be mono and little-endian.

This is my current attempt:

        byte[] audioData = userAudio.getAudioData(1);
        short[] convertedAudio = new short[audioData.length / 2];
        ByteBuffer buffer = ByteBuffer.allocate(convertedAudio.length * Short.BYTES);
        
        // Convert to mono, I don't think I did it right though
        int j = 0;
        for (int i = 0; i < audioData.length; i += 2)
            convertedAudio[j++] = (short) (audioData[i] << 8 | audioData[i + 1] & 0xFF);

        // Convert to little endian
        buffer.order(ByteOrder.BIG_ENDIAN);
        for (short s : convertedAudio)
            buffer.putShort(s);
        buffer.order(ByteOrder.LITTLE_ENDIAN);
        buffer.rewind();

        for (int i = 0; i < convertedAudio.length; i++)
            convertedAudio[i] = buffer.getShort();

        queue.add(convertedAudio);
moeux
  • 191
  • 15
  • I would do a bit of checking because this "App" that works so fine maybe is sending me a bunch of garbage - then I would check the formats the reeived and the desired - there is a thing in AudioSystem - are they as you say? I'm not sure about that! - then I would proceed to manual labor!! – gpasch Sep 29 '21 at 04:02
  • @gpasch Could you elaborate please? What "thing" in `AudioSystem`? To check the formats? What manual labor? – moeux Sep 29 '21 at 14:11

2 Answers2

1

I had this same problem and found this stackoverflow post that converts the raw pcm byte array into an audio input stream.

I assume you're using Java Discord API (JDA), so here's my initial code I have for the 'handleUserAudio()' function that utilizes vosk, and the code in the link I provided above:

                // Define audio format that vosk uses
            AudioFormat target = new AudioFormat(
                    16000, 16, 1, true, false);

            try {
                byte[] data = userAudio.getAudioData(1.0f);
                // Create audio stream that uses the target format and the byte array input stream from discord
                AudioInputStream inputStream = AudioSystem.getAudioInputStream(target,
                        new AudioInputStream(
                                new ByteArrayInputStream(data), AudioReceiveHandler.OUTPUT_FORMAT, data.length));

                // This is what was used before
//                InputStream inputStream = new ByteArrayInputStream(data);

                int nbytes;
                byte[] b = new byte[4096];
                while ((nbytes = inputStream.read(b)) >= 0) {
                    if (recognizer.acceptWaveForm(b, nbytes)) {
                        System.out.println(recognizer.getResult());
                    } else {
                        System.out.println(recognizer.getPartialResult());
                    }
                }
//                queue.add(data);
            } catch (Exception e) {
                e.printStackTrace();
            }

This works thus far, however, it throws everything into the '.getPartialResult()' method of the recognizer, but at least vosk is understanding the audio coming from the discord bot.

  • 1
    Thank you for your great answer. I had originally abandoned this project. I just want to add that I achieved better results by not using the `if` statement and printing to the console, but rather get the result at the very end. No idea why, but I got more accurate results where as the other way I sometimes got empty strings. Also I've added a `SpeechRecognizer#reset()` call between each readings, resulting in a little bit better results. `SpeechRecognizer#getPartialResult()` returned empty strings for me though, I've gone with `SpeechRecognizer#getResult()` instead. – moeux Jan 11 '22 at 01:11
-1

Signed PCM is certainly supported. The problem is that 48000 fps is not. I think the highest frame rate supported by Java directly is 44100.

As to what course of action to take, I'm not sure what to recommend. Maybe there are libraries that can be employed? It is certainly possible to do the conversions manually with the byte data directly, where you enforce the expected data formats.

I can write a bit more about the conversion process itself (assembling bytes into PCM, manipulating the PCM, creating bytes from PCM), if requested. Is the VOSK expecting 48000 fps also?


Going from stereo to mono is a matter of literally taking the sum of the left and right PCM values. It is common to add a step to ensure the range is not exceeded. (16-bit range if PCM is coded as normalized floats = -1 to 1, range if PCM is coded as shorts = -32768 to 32767.)

Following code fragment is an example of taking a single PCM value (signed float, normalized to range between -1 and 1) and generating two bytes (16-bits) in little endian order. The array buffer is of type float and holds the PCM values. The array audioBytes is of type byte.

buffer[i] *= 32767;
        
audioBytes[i*2] = (byte) buffer[i];
audioBytes[i*2 + 1] = (byte)((int)buffer[i] >> 8 );

To make it big endian, just swap the indexes of audioBytes, or the operations (byte) buffer[i] and (byte)((int)buffer[i] >> 8 ). This code is from the class AudioCue, a class that I wrote that functions as an enhanced Clip. See lines 1391-1394.

I think you can extrapolate the reverse process (converting incoming bytes to PCM). But here is an example of doing this, from the code lines 391-393. In this case temp is a float array that will hold the PCM values that are calculated from the byte stream. In my code, the value will soon be divided by 32767f to make it normalized. (line 400)

temp[clipIdx++] = ( buffer[bufferIdx++] & 0xff ) | ( buffer[bufferIdx++] << 8 ) ;

For big endian, you would reverse the order of & 0xff and << 8.

How you iterate through the structures is up to your personal preference. IDK that I've picked the optimal methods here. For your situation, I'd be tempted to hold the PCM value in a short (ranging from -32768 to 32767) instead of normalizing to -1 to 1 floats. Normalizing makes more sense if you are engaged in processing audio data from multiple sources. But the only processing you are going to do is add the left and right PCM together to get your mono value. It's good, by the way, after summing left and right, to ensure the numerical range isn't exceeded--as that can create some pretty harsh distortion.

Phil Freihofner
  • 7,645
  • 1
  • 20
  • 41
  • I've changed the sample rate to 44100 Hz in my variable `audioFormat`, I'm still getting the exception. You can set the sample rate in the vosk constructor, so that shouldn't be a problem, just a number. I'd appreciate it if you would write about the conversion process. My goal is to manipulate the byte array so it becomes mono and little endian. Since vosk's `acceptWaveform()` method just expects an `InputStream`, I can convert the manipulated byte array into a `ByteArrayInputStream` and that should hopefully work. – moeux Sep 28 '21 at 23:06
  • @PhilFreihofner thank you for updating your answer. I didn't understand it completely though. Can you explain your first code snippet to me? You said it converts it to 16-bit, little endian but what does your second snippet do then? Since I thought the second snippet was for turning it to little endian. Does your first snippet convert it into mono? What is the `float` array *buffer* populated with? Is the `byte` array *audioBytes* in my case `queue.poll()`, so the `byte` array I want to convert? Is the `float` array *temp* a copy of my audio `byte` array mentioned in above question? – moeux Sep 29 '21 at 14:22
  • Thanks for the detailed questions. I've edited in an attempt to disambiguate what I wrote previously. Is the answer clearer now? I was taking advantage of code I've previously written, and the variable names don't exactly line up with what you have. In the two snippets, the byte[] array is for the byte stream (what you receive from your source, what you send to Vosk), and the float[] array is for the PCM. I may have been too terse about how instead of using float[], it might make more sense to use a short[] array. This eliminates the normalization & denormalization steps. – Phil Freihofner Sep 29 '21 at 16:22
  • Oops comment was updated before the answer was saved. If you looked just now and didn't see the changes, please check again. Will be happy to further clarify. Always good to have other eyes look and make suggestions or corrections! – Phil Freihofner Sep 29 '21 at 16:33