What does interleaved stereo PCM linear Int16 big endian audio look like?

Question

I know that there are a lot of resources online explaining how to deinterleave PCM data. In the course of my current project I have looked at most of them...but I have no background in audio processing and I have had a very hard time finding a detailed explanation of how exactly this common form of audio is stored.

I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]... What I don't understand is what exactly this means. I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB]. Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?

Thank you everyone. Any help is appreciated.

Edit: If you choose to give examples please refer to the following.

Method Context

Specifically what I have to do is convert an interleaved short[] to two float[]'s each representing the left or right channel. I will be implementing this in Java.

public static float[][] deinterleaveAudioData(short[] interleavedData) {
    //initialize the channel arrays
    float[] left = new float[interleavedData.length / 2];
    float[] right = new float[interleavedData.length / 2];
    //iterate through the buffer
    for (int i = 0; i < interleavedData.length; i++) {
        //THIS IS WHERE I DON'T KNOW WHAT TO DO
    }
    //return the separated left and right channels
    return new float[][]{left, right};
}

My Current Implementation

I have tried playing the audio that results from this. It's very close, close enough that you could understand the words of a song, but is still clearly not the correct method.

public static float[][] deinterleaveAudioData(short[] interleavedData) {
    //initialize the channel arrays
    float[] left = new float[interleavedData.length / 2];
    float[] right = new float[interleavedData.length / 2];
    //iterate through the buffer
    for (int i = 0; i < left.length; i++) {
        left[i] = (float) interleavedData[2 * i];
        right[i] = (float) interleavedData[2 * i + 1];
    }
    //return the separated left and right channels
    return new float[][]{left, right};
}

Format

If anyone would like more information about the format of the audio the following is everything I have.

Format is PCM 2 channel interleaved big endian linear int16
Sample rate is 44100
Number of shorts per short[] buffer is 2048
Number of frames per short[] buffer is 1024
Frames per packet is 1

Your implementation looks like it should be almost exactly correct - which is confirmed when you say you can understand words, even if they sound wrong. What are the details of the output format you're using? My guess would be that the short-to-float conversion needs to be scaled and/or offset - it'd be kind of weird to use float to specify the range [-32768, 32767]. — Sbodd, Aug 20 '15 at 22:02
How did you obtain this `short[]` array? Endianness should not matter if the samples are already in two byte ints. Is the source signed or unsigned? In what range is the output expected to be? — Piotr Praszmo, Aug 20 '15 at 22:02
@Sbodd Yes reading the answers I think scaling might be the problem. I'm working on implementing a normalized process now. — William Rosenbloom, Aug 21 '15 at 02:46
@Banthar This short array comes from the [Spotify Android SDK](https://developer.spotify.com/technologies/spotify-android-sdk/android-sdk-api-reference/). This is why I only have access to these little chunks - because I only have authority to stream. The shorts are signed and their expected range encompasses (based on what I've seen in my debugger) almost the entire -32768 to 32768 range of shorts. — William Rosenbloom, Aug 21 '15 at 02:50

score 15 · Answer 1 · answered Aug 20 '15 at 22:08

I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]... What I don't understand is what exactly this means.

Interleaved PCM data is stored one sample per channel, in channel order before going on to the next sample. A PCM frame is made up of a group of samples for each channel. If you have stereo audio with left and right channels, then one sample from each together make a frame.

Frame 0: [left sample][right sample]
Frame 1: [left sample][right sample]
Frame 2: [left sample][right sample]
Frame 3: [left sample][right sample]
etc...

Each sample is a measurement and digital quantization of pressure at an instantaneous point in time. That is, if you have 8 bits per sample, you have 256 possible levels of precision that the pressure can be sampled at. Knowing that sound waves are... waves... with peaks and valleys, we are going to want to be able to measure distance from the center. So, we can define center at 127 or so and subtract and add from there (0 to 255, unsigned) or we can treat those 8 bits as signed (same values, just different interpretation of them) and go from -128 to 127.

Using 8 bits per sample with single channel (mono) audio, we use one byte per sample meaning one second of audio sampled at 44.1kHz uses exactly 44,100 bytes of storage.

Now, let's assume 8 bits per sample, but in stereo at 44.1.kHz. Every other byte is going to be for the left, and every other is going to be for the R.

LRLRLRLRLRLRLRLRLRLRLR...

Scale it up to 16 bits, and you have two bytes per sample (samples set up with brackets [ and ], spaces indicate frame boundaries)

[LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR]...

I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB].

Not necessarily. The audio can be stored in any endianness. Little endian is the most common, but that isn't a magic rule. I do think though that all channels go in order always, and front left would be channel 0 in most cases.

Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?

Each value (16-bit integer in this case) is destined for a single channel. Never would you have two multi-byte values smashed into each other.

I hope that's helpful. I can't run your code but given your description, I suspect you have an endian problem and that your samples aren't actual big endian.

score 5 · Answer 2 · answered Aug 20 '15 at 22:08

Let's start by getting some terminology out of the way

A channel is a monaural stream of samples. The term does not necessarily imply that the samples are contiguous in the data stream.
A frame is a set of co-incident samples. For stereo audio (e.g. L & R channels) a frame contains two samples.
A packet is 1 or more frames, and is typically the minimun number of frames that can be processed by a system at once. For PCM Audio, a packet often contains 1 frame, but for compressed audio it will be larger.
Interleaving is a term typically used for stereo audio, in which the data stream consists of consecutive frames of audio. The stream therefore looks like L1R1L2R2L3R3......LnRn

Both big and little endian audio formats exist, and depend on the use-case. However, it's generally ever an issue when exchanging data between systems - you'll always use native byte-order when processing or interfacing with operating system audio components.

You don't say whether you're using a little or big endian system, but I suspect it's probably the former. In which case you need to byte-reverse the samples.

Although not set in stone, when using floating point samples are usually in the range -1.0<x<+1.0, so you want to divide the samples by 1<<15. When 16-bit linear types are used, they are typically signed.

Taking care of byte-swapping and format conversions:

int s = (int) interleavedData[2 * i];
short revS = (short) (((s & 0xff) << 8) | ((s >> 8) & 0xff)) 
left[i] = ((float) revS) / 32767.0f;

Interesting that you normalize by `32767.0f`. @maxime.bochon suggests I should divide by 32768. I feel like I have also heard that for multichannel audio buffers volume should further be divided by the number of channels. What would audio sound like if it were not normalized? — William Rosenbloom, Aug 21 '15 at 03:11
That rather depends on whether a value of 1.0f is considered to be clipped or not. Normalising with `1<<15` is certainly cheaper to compute by a wide margin (the division is bit-shift). As for lack of normalisation: It makes no difference the signal chain until you hit audio hardware such as a DAC. At that point your signal will be grossly clipped in both directions. — marko, Aug 21 '15 at 06:55

maxime.bochon · Answer 3 · 2015-08-20T22:15:25.150

Actually your are dealing with an almost typical WAVE file at Audio CD quality, that is to say :

2 channels
sampling rate of 44100 kHz
each amplitude sample quantized on a 16-bits signed integer

I said almost because big-endianness is usually used in AIFF files (Mac world), not in WAVE files (PC world). And I don't know without searching how to deal with endianness in Java, so I will leave this part to you.

About how the samples are stored is quite simple:

each sample takes 16-bits (integer from -32768 to +32767)
if channels are interleaved: (L,1),(R,1),(L,2),(R,2),...,(L,n),(R,n)
if channels are not: (L,1),(L,2),...,(L,n),(R,1),(R,2),...,(R,n)

Then to feed an audio callback, it is usually required to provide 32-bits floating point, ranging from -1 to +1. And maybe this is where something may be missing in your aglorithm. Dividing your integers by 32768 (2^(16-1)) should make it sound as expected.

Honestly given this information I think I might have little endian data, which could be part of my problem. It's a long story but I thought I had big endian data because I tested audio from the same sender ***on an iPhone*** with Apple's [AudioConverter Service](https://developer.apple.com/library/ios/documentation/MusicAudio/Reference/AudioConverterServicesReference/). I do need big endian data for my destination. I also believe normalizing the data will help and am working on implementing that now. — William Rosenbloom, Aug 21 '15 at 03:20

score 0 · Answer 4 · answered Nov 16 '16 at 20:05

I ran into a similar issue with de-interleaving the short[] frames that came in through Spotify Android SDK's onAudioDataDelivered().

The documentation for onAudioDelivered was poorly written a year ago. See Github issue. They've updated the docs with a better description and more accurate parameter names:

onAudioDataDelivered(short[] samples, int sampleCount, int sampleRate, int channels)

What can be confusing is that samples.length can be 4096. However, it contains only sampleCount valid samples. If you're receiving stereo audio, and sampleCount = 2048 there are only 1024 frames (each frame has two samples) of audio in samples array!

So you'll need to update your implementation to make sure you're working with sampleCount and not samples.length.

What does interleaved stereo PCM linear Int16 big endian audio look like?

Edit: If you choose to give examples please refer to the following.

4 Answers4

Linked