Calculating spectrogram of .wav files in python

Question

I am trying to calculate the spectrogram out of .wav files using Python. In an effort to do so, I am following the instructions that could be found in here. I am firstly read .wav files using librosa library. The code found in the link works properly. That code is:

sig, rate = librosa.load(file, sr = None)
sig = buf_to_int(sig, n_bytes=2)
spectrogram = sig2spec(rate, sig)

And the function sig2spec:

def sig2spec(signal, sample_rate):

# Read the file.
# sample_rate, signal = scipy.io.wavfile.read(filename)
# signal = signal[0:int(1.5 * sample_rate)]  # Keep the first 3.5 seconds
# plt.plot(signal)
# plt.show()

# Pre-emphasis step: Amplification of the high frequencies (HF)
# (1) balance the frequency spectrum since HF usually have smaller magnitudes compared to LF
# (2) avoid numerical problems during the Fourier transform operation and
# (3) may also improve the Signal-to-Noise Ratio (SNR).
pre_emphasis = 0.97
emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
# plt.plot(emphasized_signal)
# plt.show()
# Consequently, we split the signal into short time windows. We can safely make the assumption that
# an audio signal is stationary over a small short period of time. Those windows size are balanced from the
# parameter called frame_size, while the overlap between consecutive windows is controlled from the
# variable frame_stride.

frame_size = 0.025
frame_stride = 0.01
frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate  # Convert from seconds to samples
signal_length = len(emphasized_signal)
frame_length = int(round(frame_length))
frame_step = int(round(frame_step))
num_frames = int(numpy.ceil(float(numpy.abs(signal_length - frame_length)) / frame_step))
# Make sure that we have at least 1 frame

pad_signal_length = num_frames * frame_step + frame_length
z = numpy.zeros((pad_signal_length - signal_length))
pad_signal = numpy.append(emphasized_signal, z)
# Pad Signal to make sure that all frames have equal
# number of samples without truncating any samples from the original signal

indices = numpy.tile(numpy.arange(0, frame_length), (num_frames, 1)) \
          + numpy.tile(numpy.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T

frames = pad_signal[indices.astype(numpy.int32, copy=False)]

# Apply hamming windows. The rationale behind that is  the assumption made by the FFT that the data
# is infinite and to reduce spectral leakage.
frames *= numpy.hamming(frame_length)

# Fourier-Transform and Power Spectrum
nfft = 2048
mag_frames = numpy.absolute(numpy.fft.rfft(frames, nfft))  # Magnitude of the FFT
pow_frames = ((1.0 / nfft) * (mag_frames ** 2))  # Power Spectrum

# Transform the FFT to MEL scale
nfilt = 40
low_freq_mel = 0
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))  # Convert Hz to Mel
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nfilt + 2)  # Equally spaced in Mel scale
hz_points = (700 * (10 ** (mel_points / 2595) - 1))  # Convert Mel to Hz
bin = numpy.floor((nfft + 1) * hz_points / sample_rate)

fbank = numpy.zeros((nfilt, int(numpy.floor(nfft / 2 + 1))))
for m in range(1, nfilt + 1):
    f_m_minus = int(bin[m - 1])  # left
    f_m = int(bin[m])  # center
    f_m_plus = int(bin[m + 1])  # right

    for k in range(f_m_minus, f_m):
        fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
    for k in range(f_m, f_m_plus):
        fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
filter_banks = numpy.dot(pow_frames, fbank.T)
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)  # Numerical Stability
filter_banks = 20 * numpy.log10(filter_banks)  # dB

return (filter_banks/ np.amax(filter_banks))*255

I can produce images that look like:

However, in some cases my spectrogram looks like:

Something really weird is happening since at the beginning of the signal there are some blue stripes in the images that I do not understand if they really mean something or there is an error when calculating the spectrogram. I guess the issue is related to normalization, but I am not sure what is exactly.

EDIT: I tried to use the recommended librosa from the library:

sig, rate = librosa.load("audio.wav", sr = None)
spectrogram = librosa.feature.melspectrogram(y=sig, sr=rate)
spec_shape = spectrogram.shape
fig = plt.figure(figsize=(spec_shape), dpi=5)
lidis.specshow(spectrogram.T, cmap=cm.jet)
plt.tight_layout()
plt.savefig("spec.jpg")

The spec now is almost everywhere dark blue:

Unless you're intent on implementing it yourself for some reason, I'd suggest using `scipy.signal.spectrogram`. — user2699, Feb 27 '19 at 14:35
@user2699 The OP tries to calculate mel-scaled spectrogram, so this function would not give the expected result. This [librosa function](https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html) should. Can you see run the melspectrogram and see if you get similar results? — Lukasz Tracewski, Feb 27 '19 at 15:01
My issue with Librosa was the fact that I could not reproduce the same color palette. Thus, I skiped it. I can try it but preferable I would like to use the posted code. — Jose Ramon, Feb 27 '19 at 15:04
Any idea what it means that I am receiving those spectrograms? — Jose Ramon, Feb 27 '19 at 19:07
Did you try to ask here- https://dsp.stackexchange.com/ ?They might have better ideas — Shir, Mar 03 '19 at 09:03
There's an excellent free book on DSP with Python here... http://greenteapress.com/wp/think-dsp/ — Mark Setchell, Mar 03 '19 at 12:07
You are mixing up the order of `signal` and `rate`. You first load the audio using `sig, rate = librosa.load()` and then you call `spectrogram = sig2spec(rate, sig)`. But `sig2spec` is defined `sig2spec(signal, sample_rate)`, not `sig2spec(sample_rate, signal)`. In any case, posting a [mcve] incl. the audio may help, if that was not it. — Hendrik, Mar 07 '19 at 07:23
The confusion was a typo. I will include also the audio files. Still, I am not sure that I found a proper solution. — Jose Ramon, Mar 22 '19 at 10:38

zoulzubazz · Accepted Answer · 2019-03-06T14:06:55.157

It may be because you've not tweaked the parameters of the librosa melspectrogram method.

In your original implementation you specify nfft=2048. This could be passed to melspectrogram and you will see different results.

This article describes 'waveform frequency resolution' and 'fft resolution' which are important parameters when doing a FT. Understanding them may help reproducing your original spectograms.

http://www.bitweenie.com/listings/fft-zero-padding/

The specshow method has various parameters too which will directly affect the plots you're producing.

this stack posts lists the various spectogram parameters in MATLAB, but you can draw similarities between the MATLAB version and librosa version.

What is a spectrogram and how do I set its parameters?

I think the default value is the same 2048 in the librosa melspectrogram. — Jose Ramon, Mar 06 '19 at 14:37

Calculating spectrogram of .wav files in python

1 Answers1