1

I have a series of audio sources where I am generating the spectrograms, each audio source has the same sampling rate, thus performing the fft on each audio sequence will give me equal in height(row) with varying width(col) where the height spans the frequency bins and the width the time domain.

Thus, I create a figure of width 1.0(col) and height col / row by normalising values and keeping the aspect ratio as in the following example:

if(audio_source.endswith((".flac", ".wav"))):
    raw = AudioSegment.from_file(audio_source)
else: 
    raw = AudioSegment.from_mp3(audio_source)

# single channel
raw = raw.set_channels(1)

# downsampling the audio source
raw = raw.set_frame_rate(sampling_rate)

# retrieving data
data = raw.get_array_of_samples()

# data to numpy array
data = np.array(data)

# sample frequencies, segment times(0 <-> audio's length), frequencies domain(last axis - segment times)
f, t, Sxx = signal.spectrogram(data, sampling_rate)

# dimensions
row, col = Sxx.shape[0], Sxx.shape[1]

# normalizing dimensions
row, col = 1.0, (col / row)

fig = plt.figure(figsize = (col, row), dpi = 300)
plt.set_cmap('hot')

ax = fig.add_subplot(1, 1, 1, frameon = False)
ax.pcolormesh(t, f, Sxx)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)

plt.savefig(image_source, bbox_inches = "tight", pad_inches = -0.1)

I would expect the saved images to have the same height, which they do, but for width unsurprisingly they are not proportional at all or rather slightly:

Audio Source 1 : (129, 190) -> (1.0, 1.4728) -> reality (388px, 247px) -> expectation-> (363px, 247px)
Audio Source 2 : (129, 59)  -> (1.0, 0.4573) -> reality (84px, 247px)
Audio Source 3 : (129, 121) -> (1.0, 0.9379) -> reality (228px, 247px)

388 / 247 -> 1.5708
84 / 247 -> 0.3400

What I am doing wrong?

DomainFlag
  • 563
  • 2
  • 9
  • 22

0 Answers0