Signed PCM is certainly supported. The problem is that 48000 fps is not. I think the highest frame rate supported by Java directly is 44100.
As to what course of action to take, I'm not sure what to recommend. Maybe there are libraries that can be employed? It is certainly possible to do the conversions manually with the byte data directly, where you enforce the expected data formats.
I can write a bit more about the conversion process itself (assembling bytes into PCM, manipulating the PCM, creating bytes from PCM), if requested. Is the VOSK expecting 48000 fps also?
Going from stereo to mono is a matter of literally taking the sum of the left and right PCM values. It is common to add a step to ensure the range is not exceeded. (16-bit range if PCM is coded as normalized floats = -1 to 1, range if PCM is coded as shorts = -32768 to 32767.)
Following code fragment is an example of taking a single PCM value (signed float, normalized to range between -1 and 1) and generating two bytes (16-bits) in little endian order. The array buffer is of type float
and holds the PCM values. The array audioBytes is of type byte
.
buffer[i] *= 32767;
audioBytes[i*2] = (byte) buffer[i];
audioBytes[i*2 + 1] = (byte)((int)buffer[i] >> 8 );
To make it big endian, just swap the indexes of audioBytes
, or the operations (byte) buffer[i]
and (byte)((int)buffer[i] >> 8 )
. This code is from the class AudioCue, a class that I wrote that functions as an enhanced Clip
. See lines 1391-1394.
I think you can extrapolate the reverse process (converting incoming bytes to PCM). But here is an example of doing this, from the code lines 391-393. In this case temp is a float
array that will hold the PCM values that are calculated from the byte stream. In my code, the value will soon be divided by 32767f to make it normalized. (line 400)
temp[clipIdx++] = ( buffer[bufferIdx++] & 0xff ) | ( buffer[bufferIdx++] << 8 ) ;
For big endian, you would reverse the order of & 0xff
and << 8
.
How you iterate through the structures is up to your personal preference. IDK that I've picked the optimal methods here. For your situation, I'd be tempted to hold the PCM value in a short
(ranging from -32768 to 32767) instead of normalizing to -1 to 1 floats. Normalizing makes more sense if you are engaged in processing audio data from multiple sources. But the only processing you are going to do is add the left and right PCM together to get your mono value. It's good, by the way, after summing left and right, to ensure the numerical range isn't exceeded--as that can create some pretty harsh distortion.