Bit reverse reorder optimization using Intel intrinsics

Question

A while ago, I optimized a radix2 function using SSE intrinsics and I'm nearly close to FFTW performance, So, my next step is to optimize the bit reverse reorder function that I found with the original code, to be honest, and I want to optimize it. The original code is like so:

void bit_reverse_reorder(float *x,float *y, int N)
{
   int bits=0;
   int i, j, k;
   float tempr, tempi;
   //MAXPOW = 12

   for (i=0; i<MAXPOW; i++)
      if (pow_2[i]==N) bits=i;

   for (i=0; i<N; i++)
   {
      j=0;
      for (k=0; k<bits; k++)
         if (i&pow_2[k]) j+=pow_2[bits-k-1];
      if (j>i)
      {
        tempr=x[i];
        tempi=y[i];
        x[i]=x[j];
        y[i]=y[j];
        x[j]=tempr;
        y[j]=tempi;
      }
   }
}

int main()
{
   radix2(re,im,N);

   bit_reverse_reorder(re,im,N);
}

PS: pow_2[] is a precomputed array containing the powers of 2 (1,2,4,8,16,32,...), N is the number of element = 4096, *x and *y represents ,respectively, the real and imaginary parts of each element of the input data.

The radix2 generate results that are not ordered so the stated function reorder the result.

First of all, I did not fully understand how this bit reversal works! So, I think it would be good if someone gave me hints about how this function works.

Second, I intend to use SSE intrinsics to enhance this function performance so is there any vector of 2 instructions that could be used in the swapping loop?

What is the purpose of bit_reverse_reorder()? The name implies it's moving bits around, but the code is swapping elements in arrays of floats. I is confused... — lyst, Nov 10 '16 at 14:49
I edited the question to be more-clear and for your first question i admit that the function looks like a swap function and not bit reversal — A.nechi, Nov 10 '16 at 14:58
I could be wrong, but I think the `bit_reverse_reorder` function is doing the [butterfly reordering](https://en.wikipedia.org/wiki/Butterfly_diagram) step of the FFT. If you don't understand that step of the FFT well, it's going to be hard to optimize this function. — Cornstalks, Nov 10 '16 at 15:29
Related: [In-place bit-reversed shuffle on an array](http://stackoverflow.com/questions/932079/in-place-bit-reversed-shuffle-on-an-array) — nwellnhof, Nov 10 '16 at 16:12
The bit reversal re-ordering is most likely a fairly insignificant part of your execution profile - have you actually measured/profiled it to see whether it is worth optimising further ? Remember that the FFT itself is `O(n log n)` while your bit reversal stage is just `O(n)` - unless `n` is very small it should not be a performance bottleneck. — Paul R, Nov 10 '16 at 16:44
@PaulR : The 'radix 2' takes a maximum of 189993 cycles to finish but the 'bit_reverse_reorder' takes 535293 so clearly it needs to be optimized — A.nechi, Nov 10 '16 at 21:36
Even in situations such as AES decompression where this sort of thing is used often, this step is probably not your biggest hit. If it is, take a look at various AES-128 decompressors out there (openSSL) and you'll have an example of how to do it in Intel. — Michael Dorgan, Nov 11 '16 at 00:08
Set `bits = i-1` and then drop the `-1` in `j+=pow_2[bits-k-1];` — lyst, Nov 11 '16 at 01:26
As this guy notes [http://stackoverflow.com/questions/36906/what-is-the-fastest-way-to-swap-values-in-c](http://stackoverflow.com/questions/36906/what-is-the-fastest-way-to-swap-values-in-c), instructions are faster than memory access, so you may consider dropping pow_2[] entirely and just performing a bit shift to get your power of 2 value. — lyst, Nov 11 '16 at 01:30
Yeah, what lyst said. If `pow_2[k] == 1U << k`, you should just write that instead. esp. for SIMD / auto-vectorization, a variable-shift is cheaper than a gather. Am I missing something, or this the only purpose or `pow_2`? — Peter Cordes, Nov 11 '16 at 03:57
@A.nechi: that means you're taking 130 clocks per element (or rather 260 clocks per pair of elements) to do bit reversed swapping - that's insanely high. Just use any of the standard bitwise tricks for bit reversal (no need for SIMD) and you should be able to get this down by an order of magnitude or two. — Paul R, Nov 11 '16 at 09:53
Have a look at [my answer](http://stackoverflow.com/a/40533543) to the question I linked to above. It doesn't use SIMD instructions, so I decided to add my answer to the more general question. SIMD probably won't help you anyway, except maybe to swap elements in one go if you interleave the `x` and `y` arrays. If `MAXPOW` is static or if you perform the FFT more than once, the best approach might be to precompute the indices to swap. — nwellnhof, Nov 12 '16 at 18:11

score 2 · Answer 1 · answered Nov 12 '16 at 22:24

2

Thanks to @nwellnhof comment i made another modification to the swapping function as following:

void bit_reverse_reorder(float *x,float *y, int N)
{
   unsigned i,j;
   for (i = 0, j = 0; i < N; i++) {
      if (i < j) 
      {
         float tmpx = x[i];
         x[i] = x[j];
         x[j] = tmpx;

         float tmpy = y[i];
         y[i] = y[j];
         y[j] = tmpy;
      }
      unsigned bit = ~i & (i + 1);

      unsigned rev = (N / 2) / bit;

      j ^= (N - 1) & ~(rev - 1);
   }
}

And now i get a performance of 54 900 cycles for the for loop inside the function which is also good :)

answered Nov 12 '16 at 22:24

A.nechi

521
1
5
15

`bit` is a power of 2, right? If the compiler doesn't figure that out, it will use an actual DIV instruction instead of just a right-shift. See http://stackoverflow.com/questions/40431599/efficiently-dividing-unsigned-value-by-a-power-of-two-rounding-up for some ideas about using TZCNT or a GNU C `__builtin_ctz` to implement an efficient log2 function that assumes power-of-2 inputs, to give you a shift count. (The main part of that question is about division rounding up, but you only need `unsigned rev = N >> (1+lg2(bit));` – Peter Cordes Nov 13 '16 at 00:15
Ok i will give it a try ... The issue will be used for intel and ARM architectures. So, i don't know if the `__builtin_ctz` will work for all architectures :) – A.nechi Nov 13 '16 at 15:13
It might not compile to anything wonderful on ARM, but it's definitely supported because it's not a platform-specific GNU extension. – Peter Cordes Nov 13 '16 at 18:11
1

I got curious and checked (https://godbolt.org/g/ewhq2i): ARM has a count leading zeros instruction, and a bit-reverse instruction. If you know (but the compiler doesn't) that you have a power of 2, prefer `32 - count leading zeros` instead of trailing zeros, so the compiler doesn't have to bit-reverse the register. – Peter Cordes Nov 13 '16 at 19:14

score 1 · Accepted Answer · answered Nov 11 '16 at 17:02

Basing on your suggestions I made a couple of modifications that enhanced the function performance.

First, I replaced the power_2[] calls by shifting instructions.

Then, I made a swap function that uses add/sub operations for swapping without the use of a third variable.

void swap(float* a, float* b)
{
 *a = *a+*b;
 *b = *a-*b;
 *a = *a-*b;
}

void bit_reverse_reorder(float *x,float *y, int N)
{
   int bits=0;
   int i, j, k;
   unsigned pow_bits=0;

   for (i=0; i<MAXPOW; i++)
      if (1 << i==N) bits=i-1;
      for (i=1; i<N; i++)
      {
          j=0;
          unsigned pow_2k=0,pow_2kj;

          for (k=0; k<bits; k++)
          {
             pow_2k = 1<<k;
             pow_2kj = 1<<(bits-k);
             if (i&pow_2k) j+=pow_2kj;
          }
          if (j>i)
          {
             swap(&x[i],&x[j]);
             swap(&y[i],&y[j]);
          }
       }
}

The number of cycles is reduced from almost 500 000 cycles to around 180 000 cycles.

You absolutely do not want to swap that way, esp. if you compile without `-ffast-math`. Writing it that way forces the compiler to actually generate those temporaries with correct rounding, instead of just swapping. Don't try to avoid temporaries; they're cheap. Compilers know how to use registers. — Peter Cordes, Nov 11 '16 at 21:19
Look at the asm output on godbolt, for example: https://godbolt.org/g/NR4Z7K. `swap()` compiles to a bunch of math, not just two loads and two stores. — Peter Cordes, Nov 11 '16 at 21:25

Bit reverse reorder optimization using Intel intrinsics

2 Answers2