tl/dr: I've got two audio recordings of the same song without timestamps, and I'd like to align them. I believe FFT is the way to go, but while I've got a long way, it feels like I'm right on the edge of understanding enough to make it work, and would greatly benefit from a "you got this part wrong" advice on FFT. (My education never got into this area) So I came here seeking ELI5 help.
The journey:
- Get two recordings at the same sample rate. (done!)
- Transform them into a waveform. (DoubleArray) This doesn't keep any of the meta info like "samples/second" but the FFT math doesn't care until later.
- Run a FFT on them using a simplified implementation for beginners
- Get a
Array<Frame>
, eachFrame
containsArray<Bin>
, eachBin
has(amplitude, frequency)
because the older implementation hid all the details (like frame width, and number of Bins, and ... stuff?) and outputs words I'm familiar with like "amplitude" and "frequency" - Try moving to a more robust FFT (Apache Commons)
- Get an output of 'real' and 'imaginary' (uh oh)
- Make the totally incorrect assumption that those were the same thing (amplitude and frequency). Surprise, they aren't!
- Apache's FFT returns an
Array<Complex>
which means it... er... is just one frame's worth? And I should be chopping the song into 1 second chunks and passing each one into the FFT and call it multiple times? That seems strange, how does it get lower frequencies?
To the best of my understanding, the complex number is a way to convey the phase shift and amplitude in one neat container (and you need phase shift if you want to do the FFT in reverse). And the frequency is calculated from the index of the array.
Which works out to (pseudocode in Kotlin)
val audioFile = File("Dream_On.pcm")
val (phases, amplitudes) = AudioInputStream(
audioFile.inputStream(),
AudioFormat(
/* encoding = */ AudioFormat.Encoding.PCM_SIGNED,
/* sampleRate = */ 44100f,
/* sampleSizeInBits = */ 16,
/* channels = */ 2,
/* frameSize = */ 4,
/* frameRate = */ 44100f,
/* bigEndian = */ false
),
(audiFile.length() / /* frameSize */ 4)
).use { ais ->
val bytes = ais.readAllBytes()
val shorts = ShortArray(bytes.size / 2)
ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asShortBuffer().get(shorts)
val allWaveform = DoubleArray(shorts.size)
for (i in shorts.indices) {
allWaveform[i] = shorts[i].toDouble()
}
val halfwayThroughSong = allWaveform.size / 2
val moreThanOneSecond = allWaveform.copyOfRange(halfwayThroughSong, halfwayThroughSong + findNextPowerOf2(44100))
val fft = FastFourierTransformer(DftNormalization.STANDARD)
val fftResult: Array<Complex> = fft.transform(moreThanOneSecond, TransformType.FORWARD)
println("fftResult size: ${fftResult.size}")
val phases = DoubleArray(fftResult.size / 2)
val amplitudes = DoubleArray(fftResult.size / 2)
val frequencies = DoubleArray(fftResult.size / 2)
fftResult.filterIndexed { index, _ -> index < fftResult.size / 2 }.forEachIndexed { idx, complex ->
phases[idx] = atan2(complex.imaginary, complex.real)
frequencies[idx] = idx * 44100.0 / fftResult.size
amplitudes[idx] = hypot(complex.real, complex.imaginary)
}
Triple(phases, frequencies, amplitudes)
}
Is my step #8 at all close to the truth? Why would the FFT result return an array as big as my input number of samples? That makes me think I've got the "window" or "frame" part wrong.
I read up on