Can't figure out how to compute spectral properties of a .wav file

Question

I am trying to figure out the mean frequency and the Q25 and Q75 values for a sound clip, but am running into issues (mainly due to my lack of mathematical and DSP knowledge).

I'm going off of this answer, and am running into issues combining the code in that answer with reading a .wav file.

Here is the code I'm using to record...

def record_sample(file):
    # Audio Recording
    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 2
    RATE = 44100
    RECORD_SECONDS = 5
    pa = pyaudio.PyAudio()

    # Record sample.
    stream = pa.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
    frames = []
    for _ in range(int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)
    stream.stop_stream()
    stream.close()

    # Save to wave file.
    wf = wave.open(file, "wb")
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(pa.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()

That all works fine. Here is the code I have for computing mean frequency, Q25, and Q75...

def spectral_properties(file):
    # Note: scipy.io.wavfile.read
    fs, data = wavfile.read(file)
    spec = np.abs(np.fft.rfft(data))
    freq = np.fft.rfftfreq(len(data), d=1 / fs)
    spec = np.abs(spec)
    amp = spec / spec.sum()
    amp_cumsum = np.cumsum(amp)
    Q25 = freq[len(amp_cumsum[amp_cumsum <= 0.25]) + 1]
    Q75 = freq[len(amp_cumsum[amp_cumsum <= 0.75]) + 1]
    print((freq * amp).sum(), Q25, Q75)

And the error that it is producing...

File "/home/horner/workspace/school/ML/machine-learning-project-mdx97/program/audio.py", line 65, in spectral_properties
    Q75 = freq[len(amp_cumsum[amp_cumsum <= 0.75]) + 1]
IndexError: index 298981 is out of bounds for axis 0 with size 110081

`len(amp_cumsum[amp_cumsum <= 0.75]) + 1` is `298981` while the length of the array `freq` is only `110081`. You're asking for an index which doesn't exist in `freq`. — kevinkayaks, Apr 16 '19 at 21:25
yes. so what about how I'm reading data from the wavfile is causing this? The answer I linked gives the code for calculating all the spectral properties, but doesn't really explain what format my audio data needs to be in when I do the first fft. — mdx97, Apr 16 '19 at 21:50

blubberdiblub · Answer 1 · 2019-04-17T05:28:48.993

Note that you have 2 channels, which means you're getting a 2-dimensional data. Your current version just flattens the array during some of the operations, which is why it appears to have too many elements.

There are 2 ways to solve the problem. The first is to use just one of the channels:

def spectral_properties(filename):

    fs, data = wavfile.read(filename)

    # use the first channel only
    if data.ndim > 1:
        data = data[:, 0]

    spec = np.abs(np.fft.rfft(data))
    freq = np.fft.rfftfreq(len(data), d=1/fs)

    assert len(spec) == len(freq)

    amp = spec / spec.sum()
    amp_cumsum = amp.cumsum()

    assert len(amp_cumsum) == len(freq)

    q25 = freq[len(amp_cumsum[amp_cumsum < 0.25])]
    q75 = freq[len(amp_cumsum[amp_cumsum < 0.75])]

    return (freq * amp).sum(), q25, q75


avg, q25, q75 = spectral_properties('foobar.wav')
print(avg, q25, q75)

The second one is to keep the channels and to tell the numpy functions along which axis they are supposed to operate. This also means that calculating the quartiles gets less trivial, as you need to find them separately for each channel, but it looks nearly as easy as before due to Python's list comprehensions:

def spectral_properties(filename):

    fs, data = wavfile.read(filename)

    # determine number of channels
    num_channels = data.shape[1]

    spec = np.abs(np.fft.rfft(data, axis=0))
    freq = np.fft.rfftfreq(len(data), d=1/fs)

    assert len(spec) == len(freq)

    amp = spec / spec.sum(axis=0)
    amp_cumsum = amp.cumsum(axis=0)

    assert len(amp_cumsum) == len(freq)

    q25 = [freq[len(amp_cumsum[:,j][amp_cumsum[:,j] < 0.25])] for j in range(num_channels)]
    q75 = [freq[len(amp_cumsum[:,j][amp_cumsum[:,j] < 0.75])] for j in range(num_channels)]

    return (freq[:,np.newaxis] * amp).sum(axis=0), q25, q75


avg, q25, q75 = spectral_properties('foobar.wav')
print(avg, q25, q75)

Note that the + 1 was problematic in your original expressions for finding the quartiles. Consider all values being less than 0.25 apart from the last. So the inequality would be true for n - 1 elements. You add 1, so you get n. But n is too high an index for the freq array of length n.

Also, I suspect you may want to square spec instead of leaving it at the magnitudes.

Update:

You may also want to use searchsorted to find the quartiles, which should be faster and easier to read:

q25 = freq[np.searchsorted(amp_cumsum, 0.25)]
q75 = freq[np.searchsorted(amp_cumsum, 0.75)]

And:

q25 = [freq[np.searchsorted(amp_cumsum[:,j], 0.25)] for j in range(num_channels)]
q75 = [freq[np.searchsorted(amp_cumsum[:,j], 0.75)] for j in range(num_channels)]

Thank you! I seem to be getting odd values when I run this script on the sound files from the first file in [this](http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/8kHz_16bit/) repository. The first file contains male speaking clips, which I would expect to have a mean frequency of somewhere around the 100(Hz)s... But, this code is giving me a mean frequency of ~600hz on these samples (which I believe is far above the range for human voice). I have tried both squaring and not squaring `spec` like you mentioned, with high values being produced both ways. — mdx97, Apr 17 '19 at 17:34
If it helps, I'm comparing the values I'm getting to [this](https://www.kaggle.com/primaryobjects/voicegender) dataset, which I believe utilized the R function [specan](https://www.rdocumentation.org/packages/warbleR/versions/1.1.2/topics/specan). — mdx97, Apr 17 '19 at 17:36
Well, I suspect there's more to it than just doing an FFT over the full audio. The documentation to the R `specan` function suggests that they're using a sliding window with a Hanning shape and also do other processing. A raw Fourier transform is meant for periodic data. If you apply it to unprepared non-periodic data, you will at least get boundary effects (start and end of audio don't match up smoothly) and a longer sample usually doesn't have a constant fundamental frequency over time. — blubberdiblub, Apr 18 '19 at 01:56
Oh, now I see where your `+ 1` came from. **R** uses 1-based indexing (more friendly to mathematicians, I suppose, but less friendly to programmers) while Numpy and Python in general use 0-based indexing. You need to take this difference into account and adjust stuff you port over from R accordingly. — blubberdiblub, Apr 18 '19 at 02:20

Can't figure out how to compute spectral properties of a .wav file

1 Answers1