Is this part of a real IFFT process really optimal?

Question

When calculating (I)FFT it is possible to calculate "N*2 real" data points using a ordinary complex (I)FFT of N data points. Not sure about my terminology here, but this is how I've read it described. There are several posts about this on stackoverflow already.

This can speed things up a bit when only dealing with such "real" data which is often the case when dealing with for example sound (re-)synthesis.

This increase in speed is offset by the need for a pre-processing step that somehow... uhh... fidaddles? the data to achieve this. Look I'm not even going to try to convince anyone I fully understand this but thanks to previously mentioned threads, I came up with the following routine, which does the job nicely (thank you!).

However, on my microcontroller this costs a bit more than I'd like even though trigonometric functions are already optimized with LUTs.

But the routine itself just looks like it should be possible to optimize mathematically to minimize processing. To me it seems similar to plain 2d rotation. I just can't quite wrap my head around it, but it just feels like this could be done with fewer both trigonometric calls and arithmetic operations.

I was hoping perhaps someone else might easily see what I don't and provide some insight into how this math may be simplified.

This particular routine is for use with IFFT, before the bit-reversal stage.

pseudo-version:

  INPUT  
  MAG_A/B = 0 TO 1
  PHA_A/B = 0 TO 2PI
  INDEX   = 0 TO PI/2

  r = MAG_A * sin(PHA_A)
  i = MAG_B * sin(PHA_B)
  rsum = r + i
  rdif = r - i
  r = MAG_A * cos(PHA_A)
  i = MAG_B * cos(PHA_B)
  isum = r + i
  idif = r - i
  r = -cos(INDEX)
  i = -sin(INDEX)
  rtmp = r * isum + i * rdif
  itmp = i * isum - r * rdif
  OUTPUT rsum + rtmp
  OUTPUT itmp + idif
  OUTPUT rsum - rtmp
  OUTPUT itmp - idif

original working code, if that's your poison:

  void fft_nz_set(fft_complex_t complex[], unsigned bits, unsigned index, int32_t mag_lo, int32_t pha_lo, int32_t mag_hi, int32_t pha_hi) {
    unsigned size = 1 << bits;
    unsigned shift = SINE_TABLE_BITS - (bits - 1);
    unsigned n = index;        // index for mag_lo, pha_lo
    unsigned z = size - index; // index for mag_hi, pha_hi
    int32_t rsum, rdif, isum, idif, r, i;
    r = smmulr(mag_lo,   sine(pha_lo)); // mag_lo * sin(pha_lo)
    i = smmulr(mag_hi,   sine(pha_hi)); // mag_hi * sin(pha_hi)
    rsum = r + i; rdif = r - i;
    r = smmulr(mag_lo, cosine(pha_lo)); // mag_lo * cos(pha_lo)
    i = smmulr(mag_hi, cosine(pha_hi)); // mag_hi * cos(pha_hi)
    isum = r + i; idif = r - i;
    r = -sinetable[(1 << SINE_BITS) - (index << shift)]; // cos(pi_c * (index / size) / 2)
    i = -sinetable[index << shift];                      // sin(pi_c * (index / size) / 2)
    int32_t rtmp = smmlar(r, isum, smmulr(i, rdif)) << 1; // r * isum + i * rdif
    int32_t itmp = smmlsr(i, isum, smmulr(r, rdif)) << 1; // i * isum - r * rdif
    complex[n].r = rsum + rtmp;
    complex[n].i = itmp + idif;
    complex[z].r = rsum - rtmp;
    complex[z].i = itmp - idif;
  }

  // For reference, this would be used as follows to generate a sawtooth (after IFFT)

  void synth_sawtooth(fft_complex_t *complex, unsigned fft_bits) {
    unsigned fft_size = 1 << fft_bits;
    fft_sym_dc(complex, 0, 0); // sets dc bin [0]
    for(unsigned n = 1, z = fft_size - 1; n <= fft_size >> 1; n++, z--) {
      // calculation of amplitude/index (sawtooth) for both n and z
      fft_sym_magnitude(complex, fft_bits, n, 0x4000000 / n, 0x4000000 / z);
    }
  }

1. you should measure what is slow on your implementation (on MCU it is hard but doable use Osciloscope and some test pin toggle) then try to improve the bottleneck... 2. if your FFT input data is at constant size then you can precompute the whole `W` matrix instead of `sin,cos` LUT or use sin,cos tables with step compatible with your data size. Here mine DFFT implementation in C++ with this technique http://stackoverflow.com/a/24322189/2521214 3. you did not specified platform,language ... 4. on most todays MCU's is DMA present may be you can exploit it for data transfer or butterfly ... — Spektre, Nov 18 '14 at 07:06
Note that it may even be possible to perform `sin,cos` precomputations at compile time (see eg. [this Dr. Dobb's article](http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199500857) for an implementation in C++). — SleuthEye, Nov 18 '14 at 13:58
The entire function is highly optimized for what it does. Notice inline intrinsics and quoting myself: "trigonometric functions are already optimized with LUTs.", they are inline and translate to 2 cpu cycles + 1 memory read. There are still improvements that can be done by hand optimizing the assembly, but this is not the issue. The function is profiled (professionally, not by scoping pins) and has no substantial bottlenecks. I am well versed in optimization, but weak on math. This is where I was requesting assistance. I believe the math of this routine can be simplified quite a bit. — Ronald McFüglethorn, Nov 18 '14 at 14:53
Then you might want to have a look at the [Split radix algorithm on wikipedia](http://en.wikipedia.org/wiki/Split-radix_FFT_algorithm), and a [variant from FFTW](http://www.fftw.org/newsplit.pdf). — SleuthEye, Nov 19 '14 at 11:49
@RonaldMcFüglethorn if I see it right PHA_A,PHA_B is just for sampling point phase correction why not skip it inside recursion and change the whole array before/after (I)FFT ? that would spare 4 mul's,2 sin's, 2 cos's ... (but I do not use this technique much (only for DCT) so I could be wrong) — Spektre, Nov 20 '14 at 08:22

Is this part of a real IFFT process really optimal?

0 Answers0