Convert FFT to PCM

Question

I have some FFT data, 257 dimensions, every 10 ms, with 121 frames, i.e. 1.21 secs. I think the first dimension is probably something else and the remaining are the FFT coefficients, I guess. It's probably just spectogram data. From a comment about the FFT data, sqrt10 and mean-variance-normalization might have been applied on it.

From there, I want to calculate back some PCM signal for 44.1 Hz so I can play the sound. I asked the same question in a more mathematical way here but maybe StackOverflow is a better place because I actually want to implement this. I also asked the same question about the theory here on DSP SE.

How would I do that? Maybe I need some more information (which I have to find out somehow) - which? Maybe these missing information can be intelligently guessed somehow?

This question is both about the theory and practical implementation. The implementation is trivial I guess. But a concrete example in some language would be nice to help understanding the theory. Maybe C++ with FFTW? I skipped through the FFTW docs but I fail to understand all the terminology and some background, e.g. here. Why is it from complex to real or the other way, I only want real to real. What are those REDFT? What's a DCT, DFT, DST? FFTW_HC2R?

I read all the FFT data, i.e. 121 * 257 floats, into a vector freq_bins.

std::vector<float32_t> freq_bins; // FFT data
int freq_bins_count = 257;
size_t len = 121;

std::vector<float32_t> pcm; // output, PCM data

int N = freq_bins_count;
std::vector<double> out(N), orig_in(N);

// inspiration: https://stackoverflow.com/questions/2459295/invertible-stft-and-istft-in-python/6891772#6891772
for(int f = 0; f < len; ++f) {
    size_t pos = freq_bins_count * f;
    for(int i = 0; i < N; ++i)
        out[i] = pow(freq_bins[pos + i] + offset, 10);  // fft was sqrt10 + mvn
    fftw_plan q = fftw_plan_r2r_1d(N, &out[0], &orig_in[0], FFTW_REDFT00, FFTW_ESTIMATE);
    fftw_execute(q);
    fftw_destroy_plan(q);

    // naive overlap-and-add
    auto start_frame = size_t(f * dt * sampleRate);
    for(int i = 0; i < N; ++i) {
        sample_t frame = orig_in[i] * scale / (2 * (N - 1));
        size_t idx = start_frame + i;
        while(idx >= pcm.size())
            pcm.push_back(0);
        pcm[idx] += frame;
    }
}

But this is wrong, I guess. I just get garbage out.

Related might be this question. Or this.

Given that you're talking about implementation (rather than theory), and also mentioning libraries in comments below, you should tag this question with the language you intend to use... — Oliver Charlesworth, Aug 16 '15 at 09:31
@OliverCharlesworth: This is about both, or even more about the theory. The implementation is trivial I guess. But a concrete example in some language would be nice to help understanding the theory. Maybe C++ with FFTW? I skipped through the FFTW docs but I fail to understand all the terminology and some background, e.g. [here](http://www.fftw.org/fftw3_doc/One_002dDimensional-DFTs-of-Real-Data.html#One_002dDimensional-DFTs-of-Real-Data). Why is it from complex to real or the other way, I only want real to real. What are those REDFT? What's a DCT, DFT, DST? Etc. — Albert, Aug 16 '15 at 09:37
If the question is about theory, then http://dsp.stackexchange.com is probably your best bet. — Oliver Charlesworth, Aug 16 '15 at 09:40

score 2 · Answer 1 · edited May 23 '17 at 11:54

2

If the data you are have is real then the data you have is most probably spectrogram data and if the data you are receiving is complex then you most probably have raw short time fourier transform (STFT) data (See the diagram on this post to see how STFT/spectrogram data is produced). Spectrogram data is produced by taking the magnitude squared of STFT data and is thus not invertible because all the phase information in the audio signal has been lost but raw STFT data is invertible so if that is what you have then you might want to look for a library that performs the inverse STFT function and try using that.

As for the question of what the FFT dimensions in your data represent, I reckon the 257 data points you are receiving every 10ms are the result of a 512 point FFT being used in the STFT process.The first sample is the 0Hz frequency and the rest of the 256 data points are one half of the FFT spectrum (the other half of the FFT data has been discarded because the input to the FFT is real and so one half of the FFT data is simply the complex conjugate of the other half).

In addition to this, I would like to point out that just because you are receiving FFT data every 10ms 121 times does not mean the audio signal is 1.21s.The STFT is usually produced by using overlapping windows so your audio signal is might be shorter than 1.21s.

edited May 23 '17 at 11:54

Community

1
1

answered Aug 16 '15 at 01:39

KillaKem

995
1
13
29

I only have those 257 dimensions. Even if I cannot reproduce the true signal, can I somehow reproduce some signal which would produce the same FFT data? – Albert Aug 16 '15 at 09:40
Put simply, If you have the raw STFT data (i.e the frequency data you have a matrix of complex numbers representing the audio) then you can invert the data to get you audio signal back but if you have spectrogram data (i.e the frequency data you have a matrix of real numbers representing the data) then you will not be able to invert it or even get a signal that sounds close to the original signal because all the phase information has been thrown away.For more on inverting STFTs see this: http://eeweb.poly.edu/iselesni/EL713/STFT/stft_inverse.pdf – KillaKem Aug 16 '15 at 11:31
I just have those 257 dimensions of real data (or maybe that are 128 complex + 1 real? but they look all alike, so I guess 257 real makes more sense). So I guess that is the spectogram (how exactly do you get that from the raw STFT? sqrt(abs(fft)) or so?). Can't I recreate some phase data somehow? If I just assume 0 or anything, would I get back the same spectogram? If not, can I guess the phase data somehow so that I get the same spectogram? Or are you saying that two sounds with the same spectogram can sound totally different? Why is that? That is very counter intuitive for me. – Albert Aug 16 '15 at 11:40
No, you can't guess the phase data and if you set the phase to 0 for all samples in the spectrogram you will most likely get garbage audio out after you perform an inverse STFT; the sound won't sound anything like the original audio even though the spectrogram produced by both will be the same.You only hope of recovering the audio is if the data you have is the full STFT data and not just the spectrogram. – KillaKem Aug 27 '15 at 16:22
I would also like to point out to you that just because the 2D array you have looks like it is real because all elements in it are real numbers doesn't neccesary mean that the data represented by it is real.Some libraries (e.g FFTW) have options to represent complex data arrays as real ones by first listing the real part of the array then listing the imaginary part of the array, eg see this: http://www.fftw.org/doc/The-Halfcomplex_002dformat-DFT.html – KillaKem Aug 27 '15 at 16:31

score 0 · Answer 2 · answered Aug 15 '15 at 18:38

0

You'd simply push that data you have through the inverse fourier transform. All FFT libraries offer forward and backward transformation functions.

answered Aug 15 '15 at 18:38

datenwolf

159,371
13
185
298

Could you give some more information? Currently I don't use such a library. Which one would I use and how exactly would I get my PCM data? Maybe you can give some example code for some library? – Albert Aug 15 '15 at 18:39

Convert FFT to PCM

2 Answers2