Looking at this answer: Python Scipy FFT wav files
The technical part is obvious and working, but I have two theoretical questions (the code mentioned is below):
1) Why do I have to normalized (b=...
) the frames? What would happen if I used the raw data?
2) Why should I only use half of the FFT result (d=...
)?
3) Why should I abs(c)
the FFT result?
Perhaps I'm missing something due to inadequate understanding of WAV format or FFT, but while this code works just fine, I'd be glad to understand why it works and how to make the best use of it.
Edit: in response to the comment by @Trilarion :
I'm trying to write a simple, not 100% accurate but more like a proof-of-concept Speaker Diarisation in Python. That means taking a wav file (right now I am using this one for my tests) and in each second (or any other resolution) say if the speaker is person #1 or person #2. I know in advance that these are 2 persons and I am not trying to link them to any known voice signatures, just to separate. Right now take each second, FFT it (and thus get a list of frequencies), and cluster them using KMeans with the number of clusters between 2 and 4 (A, B [,Silence [,A+B]]).
I'm still new to analyzing wav files and audio in general.
import matplotlib.pyplot as plt
from scipy.io import wavfile # get the api
fs, data = wavfile.read('test.wav') # load the data
a = data.T[0] # this is a two channel soundtrack, I get the first track
b=[(ele/2**8.)*2-1 for ele in a] # this is 8-bit track, b is now normalized on [-1,1)
c = sfft.fft(b) # create a list of complex number
d = len(c)/2 # you only need half of the fft list
plt.plot(abs(c[:(d-1)]),'r')
plt.show()