1

Apologizing if the answer is on the site somewhere (couldn't find). I'm a hobbyist who tries to load WAV file, get it's magnitude and phase data (for modification), generate spectrogram and then save it back as a new WAV file. I use C++ (Qt) and FFTW library.

My problem is that resulting WAV differs from original even when no modifications are made. If FFT operations are performed on whole sample sequence, then it looks just like the original. But I have to use STFTs with overlapping windows. In this case I get distortions resulting in periodic cracking/throttling sounds, also waveform of audio is significantly changed.

This can be seen in following examples (viewed in Audacity):

original / processed in one chunk: original

processed (windowSize=2048, hopSize=1024, no window function): processed ws=2048, hs=1024, wf=none

I can't post more examples with my reputation, but performing Hamming window function after ISTFT (not before STFT) with the method I use to combine resulting windowed samples gives good sound. But waveform is still quite different, mainly significant loss in peaks is observed.

I think the way I combine result of ISTFT into new sample sequence is the problem. What is the proper way to do this? Example in C++ would be really appreciated.

  • EDIT * As correctly pointed out by SleuthEye I made a mistake in code. Code is adjusted. Waveform and sound seems to be perfect now even without applying a window function. Still, is the method correct for such an operation?

Here's the relevant source:

// getSampleNormalized(n) returns sample n of 1 channel in -1.0 to 1.0 range
// getSampleCount()      returns sample count of 1 channel
// quint32 is just unsigned int

quint32 windowSize     = 2048;
quint32 windowSizeHalf = windowSize / 2 + 1;
quint32 slideWindowBy  = 1024; // hopSize
quint32 windowCount    = getSampleCount() / slideWindowBy;
if ( (windowCount * slideWindowBy) < getSampleCount()){
    windowCount += 1;
}
quint32 newSampleCount = windowCount * slideWindowBy + ( windowSize - slideWindowBy );

double *window          = new double[windowSize];
fftw_complex *fftResult = new fftw_complex[windowSizeHalf];
fftw_complex *fftWindow = new fftw_complex[windowSizeHalf];
double *result          = new double[windowSize];

double **magnitudes    = new double*[windowCount];
double **phases        = new double*[windowCount];
double **signalWindows = new double*[windowCount];
for (int i = 0; i < windowCount; ++i){
    magnitudes[i]    = new double[windowSizeHalf];
    phases[i]        = new double[windowSizeHalf];
    signalWindows[i] = new double[windowSize];
}

double *sampleSignals = new double[newSampleCount];

fftw_plan fftPlan  = fftw_plan_dft_r2c_1d( windowSize, window,    fftResult, FFTW_ESTIMATE );
fftw_plan ifftPlan = fftw_plan_dft_c2r_1d( windowSize, fftWindow, result,    FFTW_ESTIMATE );

// STFT
for ( int currentWindow = 0; currentWindow < windowCount; ++currentWindow ){
    for (int i = 0; i < windowSize; ++i){
        quint32 currentSample = currentWindow * slideWindowBy + i;
        if ( ( currentSample ) < getSampleCount() ){
            window[i] = getSampleNormalized( currentSample ); // * ( windowHamming( i, windowSize ) );
        }
        else{
            window[i] = 0.0;
        }
    }

    fftw_execute(fftPlan);

    for (int i = 0; i < windowSizeHalf; ++i){
        magnitudes[currentWindow][i] = sqrt( fftResult[i][0]*fftResult[i][0] + fftResult[i][1]*fftResult[i][1] );
        phases[currentWindow][i]     = atan2( fftResult[i][1], fftResult[i][0] );
    }
}

// INVERSE STFT
for ( int currentWindow = 0; currentWindow < windowCount; ++currentWindow ){
    for ( int i = 0; i < windowSizeHalf; ++i ){
        fftWindow[i][0] = magnitudes[currentWindow][i] * cos( phases[currentWindow][i] );  // Real
        fftWindow[i][1] = magnitudes[currentWindow][i] * sin( phases[currentWindow][i] );  // Imaginary
    }

    fftw_execute(ifftPlan);

    for ( int i = 0; i < windowSize; ++i ){
        signalWindows[currentWindow][i] = result[i] / windowSize;            // getting normalized result
        //signalWindows[currentWindow][i] *= (windowHamming( i, windowSize )); // applying Hamming window function
    }
}

quint32 pos;

// HERE WE COMBINE RESULTED WINDOWS

// COMBINE AND AVERAGE
// 1st window should be full replace
for ( int i = 0; i < windowSize; ++i ){
    sampleSignals[i] = signalWindows[0][i];
}
// 2nd window and onwards: combine with previous ones
for ( int currentWindow = 1; currentWindow < windowCount; ++currentWindow ){
    // combine and average with data from previous window
    for ( int i = 0; i < (windowSize - slideWindowBy); ++i ){
        pos = currentWindow * slideWindowBy + i;
        sampleSignals[pos] = (sampleSignals[pos] + signalWindows[currentWindow][i]) * 0.5;
    }
    // simply replace for the rest
    for ( int i = (windowSize - slideWindowBy); i < windowSize; ++i ){
        pos = currentWindow * slideWindowBy + i;
        sampleSignals[pos] = signalWindows[currentWindow][i];
    }
}

// then just save the wav file...
brachna
  • 21
  • 5
  • I assume you meant to use `phase[currentWindow][i]` (instead of `magnitude`) as the argument of the `cos` & `sin` when constructing `fftWindow`... – SleuthEye Jul 26 '16 at 03:29
  • I need to slap myself :) You absolutely right! Waveform looks a lot better now. What do you think about method I use to add up results? – brachna Jul 26 '16 at 03:36
  • Is your intention to eventually do some processing in the time–frequency-domain (attenuation, amplification, etc.), and then convert back to audio? If so, I think overlapping windows will be *really* hard to get right. For example, consider an 8-point vector that you chop up into 3 chunks of 4 points each, with 2-point overlap (so, 0..3, 2..5, and 4..7 indexes)—if you FFT each chunk, perform arbitrary transforms on each of the 3 chunks’ FFTs, then IFFT, there’s just no sane way to recombine them. Overlapping-window STFT & “ISTFT” make sense only if no frequency-domain operations are done. – Ahmed Fasih Jul 26 '16 at 03:48
  • But if you don’t do any processing, and just IFFT each chunk’s FFT, then of course you’ll recover a chunked & overlapped version of the original—you don’t even need to average, just discard one side of the overlap. (Here, “chunk” means a “short-time” subsection of the full data.) – Ahmed Fasih Jul 26 '16 at 03:55
  • Yes, that was my intention. Will be sad if it's not at all feasible. – brachna Jul 26 '16 at 03:57
  • So averaging the way I do is fine? – brachna Jul 26 '16 at 03:58
  • I'm reading https://ccrma.stanford.edu/~jos/parshl/Choice_Hop_Size.html (kinda technical, unfortunately) via http://stackoverflow.com/a/6891772/500207 (STFT/ISTFT in Numpy/Python), trying to answer that question (about averaging). Do you absolutely have to have overlap ? – Ahmed Fasih Jul 26 '16 at 04:19
  • Yes, I think. The reason being is that I want to make a spectrogram data model (that I want to manipulate) and the sounds I'll be working with will be about 1 second - 3 seconds tops. So without overlap I'll get very thin spectrogram. Thank you for trying to help :) – brachna Jul 26 '16 at 13:34
  • I’m so sorry, I am crazy. Even if you have zero overlap, if you manipulate the STFT array and then inverse-STFT, you will still have discontinuous jumps between chunks. Here’s how I might approach the problem, which should work in the overlapping case too. Linearly interpolate the overlapping samples. The earliest sample in overlap should be entirely from the first chunk, and the last sample in overlap from the second chunk. In between, linearly weight the samples from first & second chunks to reconstruct something in-between. – Ahmed Fasih Jul 27 '16 at 01:15
  • I think this will work better than averaging because, consider the case where you have lots of overlap, so the same input sample appears in several chunks. You apply some kind of time-dependent frequency manipulation. If you obtain an output sample by averaging all the chunks that contain it, then you could be muddling the time-dependence of the frequency filtering. With interpolation, the modifications you make to each time-chunk’s spectrum should come out much sharper. Does that make sense? I can draw a picture/write some Python code to illustrate. – Ahmed Fasih Jul 27 '16 at 01:18
  • No, I understand. Indeed linear interpolation between overlapping data seems like a better solution than just plain averaging. Thank you for clear explanation. – brachna Jul 27 '16 at 01:43
  • when you figure it out, please write it up as an answer and accept it! Also consider writing a blog post because the idea is actually really, really interesting. I'd love to see a real-time webapp that records some audio, and lets me draw some kind of time–frequency filter/attenuation/amplification, and then hear the result – Ahmed Fasih Jul 27 '16 at 02:13

0 Answers0