A while ago, I optimized a radix2 function using SSE intrinsics and I'm nearly close to FFTW performance, So, my next step is to optimize the bit reverse reorder function that I found with the original code, to be honest, and I want to optimize it. The original code is like so:
void bit_reverse_reorder(float *x,float *y, int N)
{
int bits=0;
int i, j, k;
float tempr, tempi;
//MAXPOW = 12
for (i=0; i<MAXPOW; i++)
if (pow_2[i]==N) bits=i;
for (i=0; i<N; i++)
{
j=0;
for (k=0; k<bits; k++)
if (i&pow_2[k]) j+=pow_2[bits-k-1];
if (j>i)
{
tempr=x[i];
tempi=y[i];
x[i]=x[j];
y[i]=y[j];
x[j]=tempr;
y[j]=tempi;
}
}
}
int main()
{
radix2(re,im,N);
bit_reverse_reorder(re,im,N);
}
PS: pow_2[] is a precomputed array containing the powers of 2 (1,2,4,8,16,32,...), N is the number of element = 4096, *x and *y represents ,respectively, the real and imaginary parts of each element of the input data.
The radix2 generate results that are not ordered so the stated function reorder the result.
First of all, I did not fully understand how this bit reversal works! So, I think it would be good if someone gave me hints about how this function works.
Second, I intend to use SSE intrinsics to enhance this function performance so is there any vector of 2 instructions that could be used in the swapping loop?