Getting frequency and amplitude from an audio file using FFT - so close but missing some vital insights, eli5?

Question

tl/dr: I've got two audio recordings of the same song without timestamps, and I'd like to align them. I believe FFT is the way to go, but while I've got a long way, it feels like I'm right on the edge of understanding enough to make it work, and would greatly benefit from a "you got this part wrong" advice on FFT. (My education never got into this area) So I came here seeking ELI5 help.

The journey:

Get two recordings at the same sample rate. (done!)
Transform them into a waveform. (DoubleArray) This doesn't keep any of the meta info like "samples/second" but the FFT math doesn't care until later.
Run a FFT on them using a simplified implementation for beginners
Get a Array<Frame>, each Frame contains Array<Bin>, each Bin has (amplitude, frequency) because the older implementation hid all the details (like frame width, and number of Bins, and ... stuff?) and outputs words I'm familiar with like "amplitude" and "frequency"
Try moving to a more robust FFT (Apache Commons)
Get an output of 'real' and 'imaginary' (uh oh)
Make the totally incorrect assumption that those were the same thing (amplitude and frequency). Surprise, they aren't!
Apache's FFT returns an Array<Complex> which means it... er... is just one frame's worth? And I should be chopping the song into 1 second chunks and passing each one into the FFT and call it multiple times? That seems strange, how does it get lower frequencies?

To the best of my understanding, the complex number is a way to convey the phase shift and amplitude in one neat container (and you need phase shift if you want to do the FFT in reverse). And the frequency is calculated from the index of the array.

Which works out to (pseudocode in Kotlin)

val audioFile = File("Dream_On.pcm")
val (phases, amplitudes) = AudioInputStream(
    audioFile.inputStream(),
    AudioFormat(
        /* encoding = */ AudioFormat.Encoding.PCM_SIGNED, 
        /* sampleRate = */ 44100f, 
        /* sampleSizeInBits = */ 16, 
        /* channels = */ 2, 
        /* frameSize = */ 4, 
        /* frameRate = */ 44100f, 
        /* bigEndian = */ false
    ),
    (audiFile.length() / /* frameSize */ 4)
).use { ais ->
    val bytes = ais.readAllBytes()
    val shorts = ShortArray(bytes.size / 2)
    ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asShortBuffer().get(shorts)
    val allWaveform = DoubleArray(shorts.size)
    for (i in shorts.indices) {
        allWaveform[i] = shorts[i].toDouble()
    }
    val halfwayThroughSong = allWaveform.size / 2
    val moreThanOneSecond = allWaveform.copyOfRange(halfwayThroughSong, halfwayThroughSong + findNextPowerOf2(44100))
    val fft = FastFourierTransformer(DftNormalization.STANDARD)
    val fftResult: Array<Complex> = fft.transform(moreThanOneSecond, TransformType.FORWARD)
    println("fftResult size: ${fftResult.size}")
    val phases = DoubleArray(fftResult.size / 2)
    val amplitudes = DoubleArray(fftResult.size / 2)
    val frequencies = DoubleArray(fftResult.size / 2)
    fftResult.filterIndexed { index, _ -> index < fftResult.size / 2 }.forEachIndexed { idx, complex ->
        phases[idx] = atan2(complex.imaginary, complex.real)
        frequencies[idx] = idx * 44100.0 / fftResult.size
        amplitudes[idx] = hypot(complex.real, complex.imaginary)
    }
    Triple(phases, frequencies, amplitudes)
}

Is my step #8 at all close to the truth? Why would the FFT result return an array as big as my input number of samples? That makes me think I've got the "window" or "frame" part wrong.

I read up on

Tenfour04 · Answer 1 · 2022-05-01T04:29:48.300

An audio recording in waveform is a series of sound energy levels, basically how much sound energy there should be at any one instant. Based on the bit rate, you can think of the whole recording as a graph of energy versus time.

Sound is made of waves, which have frequencies and amplitudes. Unless your recording is of a pure sine wave, it will have many different waves of sound coming and going, which summed together create the total sound that you experience over time. At any one instant of time, you have energy from many different waves added together. Some of those waves may be at their peaks, and some at their valleys, or anywhere in between.

An FFT is a way to convert energy-vs.-time data into amplitude-vs.-frequency data. The input to an FFT is a block of waveform. You can't just give it a single energy level from a one-dimensional point in time, because then there is no way to determine all the waves that add together to make up the amplitude at that point of time. So, you give it a series of amplitudes over some finite period of time.

The FFT then does its math and returns a range of complex numbers that represent the waves of sound over that chunk of time, that when added together would create the series of energy levels over that block of time. That's why the return value is an array. It represents a bunch of frequency ranges. Together the total data of the array represents the same energy from the input array.

You can calculate from the complex numbers both phase shift and amplitude for each frequency range represented in the return array.

Ultimately, I don’t see why performing an FFT would get you any closer to syncing your recordings. Admittedly it’s not a task I’ve tried before. But I would think waveform data is already the perfect form for comparing the data and finding matching patterns. If you break your songs up into chunks to perform FFTs on, then you can try to find matching FFTs but they will only match perfectly if your chunks are divided exactly along the same division points relative to the beginning of the original recording. And even if you could guarantee that and found matching FFT’s, you will only have as much precision as the size of your chunks.

But when I think of apps like Shazam, I realize they must be doing some sort of manipulation of the audio that breaks it down into something simpler for rapid comparison. That possibly involves some FFT manipulation and filtering.

Maybe you could compare FFTs using some algorithm to just find ones that are pretty similar to narrow down to a time range and then compare wave form data in that range to find the exact point of synchronization.

user650881 · Answer 2 · 2022-05-01T05:29:03.757

0

I would imagine the approach that would work well would to find the offset with the maximum cross-correlation between the two recordings. This means calculate the cross-correlation between the two pieces at various offsets. You would expect the maximum cross-correlation to occur at the offset where the two piece were best aligned.

edited May 01 '22 at 05:29

answered May 01 '22 at 05:20

user650881

2,214
19
31

Getting frequency and amplitude from an audio file using FFT - so close but missing some vital insights, eli5?

2 Answers2