Understanding audio file spectrogram values

Question

I am currently struggling to understand how the power spectrum is stored in the kaldi framework.

I seem to have successfully created some data files using

$cmd JOB=1:$nj $logdir/spect_${name}.JOB.log \
    compute-spectrogram-feats --verbose=2 \
     scp,p:$logdir/wav_spect_${name}.JOB.scp ark:- \| \
    copy-feats --compress=$compress $write_num_frames_opt ark:- \
      ark,scp:$specto_dir/raw_spectogram_$name.JOB.ark,$specto_dir/raw_spectogram_$name.JOB.scp

Which gives me a large file with data point for different audio files, like this.

The problem is that I am not sure on how I should interpret this data set, I know that prior to this an fft is performed, which I guess is a good thing.

The output example given above is from a file which is 1 second long.
all the standard has been used for computing the spectogram, so the sample frequency should be 16 kHz, framelength = 25 ms and overlap = 10 ms. The number of data points in the first set is 25186.

Given these informations, can I interpret the output in some way?

Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length. So is this the same case? 16000/25186 = 0.6... Hz/bin?

Or am I interpreting it incorrectly?

SleuthEye · Accepted Answer · 2017-01-16T03:08:17.257

4

Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length.

So is this the same case? 16000/25186 = 0.6... Hz/bin?

The formula F_s/N is indeed what you would use to compute the frequency bin size. However, as you mention N is the FFT length, not the total number of samples. Based on the approximate 25ms framelength, 10ms hop size and the fact that your generated output data file has 98 lines of 257 values for some presumably real-valued input, it would seem that the FFT length used was 512. This would give you a frequency bin size of 16000/512 = 31.25 Hz/bin.

Based on this scaling, plotting your raw data with the following Matlab script (with the data previously loaded in the Z matrix):

fs       = 16000; % 16 kHz sampling rate
hop_size = 0.010; % 10 millisecond 
[X,Y]=meshgrid([0:size(Z,1)-1]*hop_size, [0:size(Z,2)-1]*fs/512);
surf(X,Y,transpose(Z),'EdgeColor','None','facecolor','interp');
view(2);
xlabel('Time (seconds)');
ylabel('Frequency (Hz)');

gives this graph (the dark red regions are the areas of highest intensity):

edited Jan 16 '17 at 03:08

answered Jan 12 '17 at 02:55

SleuthEye

14,379
2
32
61

1

The dataset only have 97 lines (not 98 new lines).. First line doesn't seem to belong to the dataset.. – Carlton Banks Jan 12 '17 at 15:23
2

@CarltonBanks A real FFT on `N` samples will output `N/2 + 1` complex values, including 2 purely real values at the DC and `F_s/2` frequencies. For `N=512` that gives `512/2 + 1 = 257`. Also, the dataset file has 99 lines. Excluding the first line which doesn't contain data gives 98 rows of data. – SleuthEye Jan 12 '17 at 15:44
Sorry for asking this many questions.. But could you perhaps elaborate on the number 98... i guess i understand why there is 257 entries, but I am not sure i understand why there is 98 lines?.. what does the line represent here?.. – I am not Fat Jan 12 '17 at 20:50
1

Each line represents the spectrum computed using the FFT on 1 block of input data. As such it shows the spectrum for that particular time slice. If the input data had been separated into disjoint blocks, then you would get 16000/512 ~ 31 such lines (which are essentially independent). Now since there is overlap in the FFT blocks you get more lines. Those lines are no longer independent but provide a slightly better time resolution (they could be seen as a kind of interpolation). When each block starts ~160 samples later than the previous you need (16000-512)/160 + 1 ~ 98 blocks to cover 1s – SleuthEye Jan 13 '17 at 00:58
... Well isn't your time axis incorrect?.. I mean would it not be a percentage then s as in seconds?.. `[0:97]/97 = [0 ... 1]` – I am not Fat Jan 13 '17 at 13:24
Your input covers 1 second, does it not? – SleuthEye Jan 13 '17 at 13:40
Do have any idea of the unit of the z-matrix.. It doesn't seem that dataset it is the log of the power spectrum. Yes it covers 1 second .. but if the z matrix had 69 rows it would be less than 1 second.... according to the way you've set up the axis will the y-axis always end with 1. – I am not Fat Jan 13 '17 at 15:51
why doesn't the plot cover 1 second fully?.. It seem to stop before just 1s. – Carlton Banks Feb 14 '17 at 12:20
@CarltonBanks A block of 512 samples covers 0.032s, so because it cannot be represented as a single time instant I choose the start of the frame as the reference time for the frame to make it easier to plot. For the 98 blocks spaced 10ms apart, the start time of each frame goes from 0.0 to 0.97s (with the last frame finishing at 1.002s). – SleuthEye Feb 14 '17 at 16:57
The plot shows that this isn't a logged power spectrum, would it be possible using kaldi to extract the dataset for the logged power spectrum ? – Carlton Banks Mar 13 '17 at 09:52
@CarltonBanks I haven't attempted to do that. But short of any other built-in means, tweaking kaldi you could create your own log-scale power spectrum computer inspired by the [`MfccComputer`](http://kaldi-asr.org/doc/classkaldi_1_1MfccComputer.html) which would use [`MelBanks`](http://kaldi-asr.org/doc/classkaldi_1_1MelBanks.html) with appropriately chosen center frequencies (following a log-scale instead of the standard mel-scale). – SleuthEye Mar 14 '17 at 01:40

Understanding audio file spectrogram values

1 Answers1