11

Looking at this answer: Python Scipy FFT wav files

The technical part is obvious and working, but I have two theoretical questions (the code mentioned is below):

1) Why do I have to normalized (b=...) the frames? What would happen if I used the raw data?

2) Why should I only use half of the FFT result (d=...)?

3) Why should I abs(c) the FFT result?

Perhaps I'm missing something due to inadequate understanding of WAV format or FFT, but while this code works just fine, I'd be glad to understand why it works and how to make the best use of it.

Edit: in response to the comment by @Trilarion :

I'm trying to write a simple, not 100% accurate but more like a proof-of-concept Speaker Diarisation in Python. That means taking a wav file (right now I am using this one for my tests) and in each second (or any other resolution) say if the speaker is person #1 or person #2. I know in advance that these are 2 persons and I am not trying to link them to any known voice signatures, just to separate. Right now take each second, FFT it (and thus get a list of frequencies), and cluster them using KMeans with the number of clusters between 2 and 4 (A, B [,Silence [,A+B]]).

I'm still new to analyzing wav files and audio in general.

import matplotlib.pyplot as plt
from scipy.io import wavfile # get the api
fs, data = wavfile.read('test.wav') # load the data
a = data.T[0] # this is a two channel soundtrack, I get the first track
b=[(ele/2**8.)*2-1 for ele in a] # this is 8-bit track, b is now normalized on [-1,1)
c = sfft.fft(b) # create a list of complex number
d = len(c)/2  # you only need half of the fft list
plt.plot(abs(c[:(d-1)]),'r') 
plt.show()
Guy Rapaport
  • 318
  • 3
  • 11
  • 1
    For starters, you can read [this](http://mathworks.com/help/matlab/math/fast-fourier-transform-fft.html). – mkrieger1 Jul 27 '15 at 20:46
  • 1
    As for (2): Looks like the original answer cuts of the negative frequency terms and uses only the positive frequency terms. For an audio signal those should be redundant. – dhke Jul 27 '15 at 20:49
  • 1
    Please make a real question out of it. Why you should do something obviously depends on what you want to achieve. As it is this question is unclear and therefore not useful except for you. The answers are very generous in explaining the knowledge behind fourier transforms but they can never answer why you should do it. – NoDataDumpNoContribution Jul 27 '15 at 21:05
  • 1
    @Trilarion on the contrary, this question comes down to the nature of FFT itself and the answers would be quite useful to anybody dabbling in it for the first time. My only concern is that it might have already been answered elsewhere on the site. – Mark Ransom Jul 27 '15 at 21:44
  • @MarkRansom Sure FFTs are interesting. But this question is not very helpful. At least now we know what the asker wants to achieve. Speech is obviously a real valued signal. I'm sure there are variants to compute the FFT of real valued signals where you do not have to throw away half of the output but where only half of the output is calculated from the beginning. In short, I prefer clearer, more precise questions. If one wants to know more about the true nature of FFT than one should ask for exactly this. The better one asks the more helpful the answers and question will be for everyone. – NoDataDumpNoContribution Jul 30 '15 at 12:23

2 Answers2

8

To address these in order:

1) You don't need to normalize, but the input normalization is close to the raw structure of the digitized waveform so the numbers are unintuitive. For example, how loud is a value of 67? It's easier to normalize it to be in the range -1 to 1 to interpret the values. (But if you wanted to implement a filter, for example, where you did an FFT, modified the FFT values, followed by an IFFT, normalizing would be an unnecessary hassle.)

2) and 3) are similar in that they both have to do with the math living primarily in the complex numbers space. That is, FFTs take a waveform of complex numbers (eg, [.5+.1j, .4+.7j, .4+.6j, ...]) to another sequence of complex numbers.

So in detail:

2) It turns out that if the input waveform is real instead of complex, then the FFT has a symmetry about 0, so only the values that have a frequency >=0 are uniquely interesting.

3) The values output by the FFT are complex, so they have a Re and Im part, but this can also be expressed as a magnitude and phase. For audio signals, it's usually the magnitude that's the most interesting, because this is primarily what we hear. Therefore people often use abs (which is the magnitude), but the phase can be important for different problems as well.

tom10
  • 67,082
  • 10
  • 127
  • 137
0

That depends on what you're trying to do. It looks like you're only looking to plot the spectral density and then it's OK to do so.

In general the coefficient in the DFT is depending on the phase for each frequency so if you want to keep phase information you have to keep the argument of the complex numbers.

The symmetry you see is only guaranteed if the input is real numbered sequence (IIRC). It's related to the mirroring distortion you'll get if you have frequencies above the Nyquist frequency (half the sampling frequency), the original frequency shows up in the DFT, but also the mirrored frequency.

If you're going to inverse DFT you should keep the full data and also keep the arguments of the DFT-coefficients.

skyking
  • 13,817
  • 1
  • 35
  • 57