Is it correct to assume that floating-point samples in a WAV or AIFF file will be normalized?

Question

Say I have a program that reads a .WAV or .AIFF file, and the file's audio is encoded as floating-point sample-values. Is it correct for my program to assume that any well-formed (floating-point-based) .WAV or .AIFF file will contain sample values only in the range [-1.0f,+1.0f]? I couldn't find anything in the WAV or AIFF specifications that addresses this point.

And if that is not a valid assumption, how can one know what the full dynamic range of the audio in the file was intended to be? (I could read the entire file and find out what the file's actual minimum and maximum sample values are, but there are two problems with that: (1) it would be a slow/expensive operation if the file is very large, and (2) it would lose information, in that if the file's creator had intended the file to have some "headroom" so as not play at dbFS at its loudest point, my program would not be able to detect that)

By "normalized", do you mean "clamped" (to [-1,+1] in this case)? Normalization in a floating-point context usually refers to the normalization requirement for the significand/mantissa in IEEE-754 floating-point format. In fact, in those floating-point formats, data very small in magnitude is stored as denormalized numbers, and this can trigger massive slowdowns on some processors, unless such operands are flushed to zero. — njuffa, Apr 21 '15 at 04:22
.WAV and .AIFF merely specify container formats that can be used with numerous audio coding formats. It is not immediately clear that the data cannot exceed the range [-1,+1] across any of the supported audio coding formats. Some of the PCM fixed-point encodings would *appear* to be limited to that range. — njuffa, Apr 26 '15 at 01:00
njuffa any thoughts about the questions raised in the second paragraph? — Jeremy Friesner, Apr 26 '15 at 02:16
Sorry, I don't have any ideas. I am not even sure I understand what that second question is asking. — njuffa, Apr 26 '15 at 03:01
Imagine you wanted to author a sound file that is intended to be played at half the usual volume. If it was a signed-16-bit sound file, you could achieve that by scaling the file's sample values to fit inside the range [-16384,+16383], rather than the usual [-32768, +32767]. Or, if it was a floating point file and one could assume a possible-values-range of [-1.0, +1.0], you could achieve that effect by scaling the samples to fit inside [-0.5, +0.5]. But if you can't assume that [-1.0, +1.0] represents the full range, then how could the player know to play it at half-normal-volume? — Jeremy Friesner, Apr 26 '15 at 05:30
I'm voting to close this question as off-topic because it belongs on a different SE site. — Cole Tobin, Dec 26 '19 at 17:11

score 12 · Accepted Answer · edited Jun 20 '20 at 09:12

As you state, the public available documentation do not go into details about the range used for floating point. However, from practice in the industry over the last several years, and from actual data existing as floating point files, I would say it is a valid assumption.

There are practical reasons to this as well as a very common range for normalization of high-precision data being color, audio, 3D etc.

The main reason for the range to be in the interval [-1, 1] is that it is fast and easy to scale/convert to the target bit-range. You only need to supply the target range and multiply.

For example:

If you want to play it at 16-bit you would do (pseudo, assuming signed rounded to integer result):

sample = in < 0 ? in * 0x8000 : in * 0x7fff;

or 24-bit:

sample = in < 0 ? in * 0x800000 : in * 0x7fffff;

or 8-bit:

sample = in < 0 ? in * 0x80 : in * 0x7f;

etc. without having to adjust the original input value in any way. -1 and 1 would represent min/max value when converted to target (1x = x).

If you used a range of [-0.5, 0.5] you would first (or at some point) have to adjust the input value so a conversion to for example 16-bit would need extra steps - this has an extra cost, not only for the extra step but also as we would work in the floating point domain which is heavier to compute (the latter is perhaps a bit legacy reason as floating point processing is pretty fast nowadays, but in any case).

in = in * 2;
sample = in < 0 ? in * 0x8000 : in * 0x7fff;

Keeping it in the [-1, 1] range rather than some pre-scaled ranged (for example [-32768, 32767]) also allow use of more bits for precision (using the IEEE 754 representation).

UPDATE 2017/07

Tests

Based on questions in comments I decided to triple-check by making a test using three files with a 1 second sine-wave:

A) Floating point clipped
B) Floating point max 0dB, and
C) integer clipped (converted from A)

The files where then scanned for positive values <= -1.0 and >= 1.0 starting after the data chunk and size field to make min/max values reflect the actual values found in the audio data.

The results confirms that the range is indeed in the [-1, 1] inclusive range, when not clipping (non-true <= 0 dB).

But it also revealed another aspect -

WAV files saved as floating point do allow values exceeding the 0 dB range. This means the range is actually beyond [-1, 1] for values that normally would clip.

The explanation for this can be that floating point formats are intended for intermediate use in production setups due to very little loss of dynamic range, where future processing (gain-staging, compressing, limiting etc.) can bring back the values (without loss) well within the final and normal -0.2 - 0 dB range; and therefor preserves the values as-is.

In conclusion

WAV files using floating point will save out values in the [-1, 1] when not clipping (<= 0dB), but does allow for values that are considered clipped

But when converted to a integer format these values will clip to the equivalent [-1, 1] range scaled by the bit-range of the integer format, regardless. This is natural due to the limited range each width can hold.

It will therefor be up the player/DAW/edit software to handle clipped floating point values by either normalizing the data or simply clip back to [-1, 1].

file1
^{Notes: Max values for all files are measured directly from the sample data.}

file2
^{Notes: Produced as clipped float (+6 dB), then converted to signed 16-bit and back to float}

file3
^{Notes: Clipped to +6 dB}

file4
^{Notes: Clipped to +12 dB}

Simple test script and files can be found here.

Thanks for posting this answer. Is the encodable value-range indeed `[-1, +1]`, or is it `[-1, +1)`? In other words: Is the value of `+1` itself included in the encoded range of values? *[It seems that this would require a different quantization-step for the positive-values range, i.e. for values > 0]* — Bliss, Jun 08 '17 at 14:57
Here it's inclusive [-1, +1] which is why you need two different scale values as shown (to be super-accurate at least). If super-accuracy isn't important you can of course use [-1,+1> and loose full positive value of 1 using 0x7fff etc. for both signs. That being said, this is usually not a real-life problem though (I'm just picky) :) — , Jun 29 '17 at 18:38
Thanks much for your reply. Is this the formal-range which is **actually being used** for common audio-file formats (e.g. WAV)? { Meaning: **with** the `+1` included in the encodable-values range }. I couldn't find any formal documentation of this, and would have assumed that for simplicity & performance reasons, implementors of software/hardware encoders would ignore the `+1` value. Did you learn, from your experience, what is actually being done? — Bliss, Jul 02 '17 at 07:15
@Bliss I did some tests; added the results to the answer. The range is [0,1] and it turns out actually goes beyond, but to keep the files clip-free (<= 0dB, when converted to f.e.x integer) the absolute range is [0,1] inclusive. — , Jul 14 '17 at 23:22
I understood that (i.e. that [0, 1] refers to sample's *absolute* value). Thanks! — Bliss, Oct 15 '17 at 12:10
1) By multiplying negative and positive values by a different constant, you perform a non linear transformation (bad). 2) There is nothing special about the range [-1,+1] when it comes to conversion, if you want to convert a range [-0.5, +0.5] you simply multiply by a different (twice as large) constant. — user1146657, Aug 08 '18 at 12:20

score 3 · Answer 2 · answered Apr 28 '15 at 19:31

I know the question was not specific to a given programming language or framework, but I could not find the answer in any specification. What I can say for sure is that the NAudio library that is widely used to handle .WAV files in applications written for the .NET framework assumes that the float samples are in the range [-1.0,+1.0].

Here is the applicable code from its source code:

namespace NAudio.Wave
{
    public class WaveFileReader : WaveStream
    {
        ...
        /// <summary>
        /// Attempts to read the next sample or group of samples as floating point normalised into the range -1.0f to 1.0f
        /// </summary>
        /// <returns>An array of samples, 1 for mono, 2 for stereo etc. Null indicates end of file reached
        /// </returns>
        public float[] ReadNextSampleFrame()
        {
            ...
            var sampleFrame = new float[waveFormat.Channels];
            int bytesToRead = waveFormat.Channels*(waveFormat.BitsPerSample/8);
            ...
            for (int channel = 0; channel < waveFormat.Channels; channel++)
            {
                if (waveFormat.BitsPerSample == 16)
                ...
                else if (waveFormat.BitsPerSample == 32 && waveFormat.Encoding == WaveFormatEncoding.IeeeFloat)
                {
                    sampleFrame[channel] = BitConverter.ToSingle(raw, offset);
                    offset += 4;
                }
                ...
            }
            return sampleFrame;
        }
        ...
    }
}

So it just copies the float into the array without doing any transformations on it and promises it to be in the given range.

score 1 · Answer 3 · answered Apr 30 '15 at 02:51

Yes.

Audio file formats act as carriers for one or more channels of audio data. That audio data has been encoded using a particular audio coding format. Each coding format uses an encoder algorithm. The algorithm is the important part. We can hand wave away the value of the file and coding formats.

AIFF and WAV both use Pulse-Code Modulation (PCM) or its descendants. (If you check out this Oracle doc, you'll notice that under "Encoding/CompressionType" lists of PCM-based algorithms.) PCM works by sampling the audio sine wave at fixed time intervals and choosing the nearest digital representation. The important point here is "sine wave".

Sine waves modulate between -1 and 1, thus all PCM-derived encodings will operate on this principle. Consider the mu-law implementation: notice in its defining equation the range is required to be -1 to 1.

I am doing a lot of hand-waving to answer this in brief. Sometimes we must necessarily lie to the kids. If you want to dig deeper into floating-point vs. fixed-point, importance of bit-depth to errors, etc. check out a good book on DSP. To get you started:

Is it correct to assume that floating-point samples in a WAV or AIFF file will be normalized?

3 Answers3

Tests

In conclusion

Linked