2

I have vector of 1024*4608 elements (Original_signal) which is stored in one-dimention array.

And I enlarged the Original_signal to Expand_signal by copying every 1024 elements 32 times to 1024*32*4608.

Then I use a Com_array of 1024*32 to do the element-to-element multiplication with the Expand_signal and do the 1024FFT of the After multiplying array.

The core code is like follows:

//initialize Original_signal
MKL_Complex8 *Original_signal = new MKL_Complex8[1024*4608];
for (int i=0; i<4608; i++)
{
  for (int j=0; j<1024; j++)
    {
      Original_signal[j+i*1024].real=rand();
      Original_signal[j+i*1024].imag=rand();
    }
 }
//Com_array
MKL_Complex8 *Com_array= new MKL_Complex8[32*1024];
for (int i=0; i<32; i++)
  {
    for (int j=0; j<1024; j++)
      {
        Com_array[j+i*1024].real=cosf(2*pi*(i-16.0)/10.0*j^2);
        Com_array[j+i*1024].imag=sinf(2*pi*(i-16.0)/10.0*j^2);
      }
  }


//element-to-element multiplication
MKL_Complex8 *Temp_signal= new MKL_Complex8[1024*32];
MKL_Complex8 *Expand_signal= new MKL_Complex8[1024*32*4608];

gettimeofday(&Bgn_Time, 0);

for (int i=0; i<4608; i++)
  {
    for (int j=0; j<32; j++)
      {
        memcpy(Temp_signal+j*1024, Original_signal+i*1024, 1024*sizeof(MKL_Complex8));
      }
      vmcMul(1024*32, Temp_signal, Com_array, Expand_signal+i*1024*32);
  }

gettimeofday(&End_Time, 0);
double time_used = (double)(End_Time.tv_sec-Bgn_Time.tv_sec)*1000000+(double)(End_Time.tv_usec-Bgn_Time.tv_usec);
printf("element-to-element multiplication use time %fus\n, time_used ");


//FFT
DFTI_DESCRIPTOR_HANDLE h_FFT = 0;
DftiCreateDescriptor(&h_FFT, DFTI_SINGLE, DFTI_COMPLEX, 1, 1024);
DftiSetValue(h_FFT, DFTI_NUMBER_OF_TRANSFORMS, 32*4608);
DftiSetValue(h_FFT, DFTI_INPUT_DISTANCE, 1024);
DftiCommitDescriptor(h_FFT);


gettimeofday(&Bgn_Time, 0);

DftiComputeForward(h_FFT,Expand_signal);

gettimeofday(&End_Time, 0);
double time_used = (double)(End_Time.tv_sec-Bgn_Time.tv_sec)*1000000+(double)(End_Time.tv_usec-Bgn_Time.tv_usec);
printf("FFT use time %fus\n, time_used ");

The time of element-to-element multiplication is 700ms(After removing the memcpy cost), And the time of FFT is 500ms.

The complex multiplication computation of FFT is N/2log2N And the element-to-element multiplication is N.

In this project N=1024. FFT is 5 times slower than element-to-element multiplication in theory. Why is faster in actual.

Any way to speed up the project?

(notice that Com_array is symmetrical)

Jie.Chen
  • 73
  • 9
  • 1
    Your timings probably include a significant amount of I/O. For the element multiplication you have 2N reads. For the FFT it's N reads. There are also fewer function calls overhead in the FFT case. You may also want to check the CPU/core dispatch schedule to see whether the many FFT are done in parallel, and whether that's also the case for vcMul. – SleuthEye Mar 19 '19 at 05:22
  • 3
    Also in general to say that one algorithm's time complexity is N and another is N log N is not to say that these numbers are comparible. In both cases there is a constant factor (C1 * N vs C2 * N * log(N)) which _will_ be different in every case. An insertion sort is O(N*N) while a quick sort is O(N log N) - however for short lists an insertion sort is usually faster because the (implied) constant is smaller. – Euan Smith Mar 19 '19 at 12:49
  • Welcome to Stackoverflow! Do you use a particular value for the [mode of multiplication](https://software.intel.com/en-us/mkl-developer-reference-c-vmlsetmode#DA4BFEA0-F845-4656-9AFA-3E33909F5834) ? Some of these modes perform error checking or increase the accuracy. These features might induce slower computations. The flop count of a dft of length N is about 5Nlog_2(N) (//Cooley Tukey algorithm) . http://www.fftw.org/fftw-paper.pdf For N=1024, it roughtly corresponds to 50 flop for each value. It can be divided by 2 for real signals. It is indeed much larger than 1 multiplication! – francis Mar 19 '19 at 18:34
  • @francis I specially used the mode VML_EP in vmcMul, it can speed up the multiplication and decrease the accuracy. But the time cost was still larger than FFT. – Jie.Chen Mar 20 '19 at 02:09
  • @SleuthEye I bound this thread in only one core of CPU. So I think they are in the same environment. – Jie.Chen Mar 20 '19 at 02:12
  • @EuanSmith Maybe you are right. The time complexity doesn't equal to computational efficiency in code. So the actual time cost is much larger than I calculated in theory. I realy need to speed up the program. I will be deeply indebted if you have any idea. – Jie.Chen Mar 20 '19 at 02:18
  • Given the extra I/O in your vcMul path, I'd suggest to get rid of the unnecessary data duplication with `memcpy` and multiply different chunks of `com_array` with the same input. – SleuthEye Mar 20 '19 at 21:23
  • @francis: this is about complex multiplication, not about real multiplication. Complex multiplication is complicated! https://locklessinc.com/articles/complex_multiplication/ – Cris Luengo Mar 21 '19 at 03:34
  • @SleuthEye It isn't easy like you think.eg.(a+ib)(c+id)=ac-bd+i(bc+ad); But (a+ib)(c-id)=ac+bd+i(bc-ad). Both the real part and imag part are different. It can't be done by sign correction for the imaginary part. – Jie.Chen Mar 21 '19 at 11:11
  • My comment was initially based on your statement that `com_array` was symmetric. But you are right that it isn't that simple for a complex conjugate symmetry. The earlier statement about multiplying different chunks of `com_array` with `Original_signal` without the `memcpy` still applies though. – SleuthEye Mar 21 '19 at 13:17
  • @CrisLuengo : thanks for the very interresting link! Indeed, handling the Nan and Inf case and keeping a good accuracy in complex multiplication likely induces performance losses since a function is called and many tests are performed. I wonder is there is a correct way to bypass these safeguards if special values are not a concern (Does GCC's "-ffast-math" work with MKL functions?). A complex multiplication remains at least 6 flops, 6 times more than 1 real multiplication... Thanks Jie.Chen for having tested the VML_EP mode. – francis Mar 21 '19 at 17:58

1 Answers1

0

In this project N=1024. FFT is 5 times slower than element-to-element multiplication in theory. Why is faster in actual?

As was indicated in comments, the time complexity of the FFT gives you a relative measure for various FFT lengths, up to some constant factor. This factor becomes important when trying to compare with other computations. Also, your analysis assumes that the performance is limited by floating point operations, where in reality the actual performance seems limited by other factors such as special case handling (e.g. NaN, Inf), memory and cache access.

Any way to speed up the project?

Since your performance bottleneck is around the complex element-wise vector multiplication operation, the following will focus on improving the performance of that operation.

I do not have MKL to perform actual benchmarks, but it is probably fair to assume that the vmcMul implementation is both fairly robust to special cases such as NaN and Inf, and fairly optimized in the circumstances.

If you do not need the robustness against the special cases, run on a SSE3 processor, can guarantee that your array sizes are multiples of 2, and that they are 16-bytes aligned, then you may get some performance gains by using a simplified implementation such as the following (based on Sebastien's answer to another post):

#include <pmmintrin.h>
#include <xmmintrin.h>

// Computes and element-by-element multiplication of complex vectors "a" and "b" and
// stores the results in "c".
// Vectors "a", "b" and "c" must be:
//   - vectors of even length N
//   - 16-bytes aligned
// Special cases such as NaN and Inf are not handled.
//
// based on https://stackoverflow.com/questions/3211346/complex-mul-and-div-using-sse-instructions#4884057
void packed_vec_mult(int N, MKL_Complex8* a, MKL_Complex8* b, MKL_Complex8* c)
{
  int M = N/2;

  __m128* aptr = reinterpret_cast<__m128*>(a);
  __m128* bptr = reinterpret_cast<__m128*>(b);
  __m128* cptr = reinterpret_cast<__m128*>(c);
  for (int i = 0; i < M; i++)
  {
    __m128 t0 = _mm_moveldup_ps(*aptr);
    __m128 t1 = *bptr;
    __m128 t2 = _mm_mul_ps(t0, t1);
    __m128 t3 = _mm_shuffle_ps(t1, t1, 0xb1);
    __m128 t4 = _mm_movehdup_ps(*aptr);
    __m128 t5 = _mm_mul_ps(t4, t3);
    *cptr = _mm_addsub_ps(t2, t5);

    ++aptr;
    ++bptr;
    ++cptr;
  }
}

Once the multiplication is optimized, your implementation could still be improved by getting rid of the extra copies to Temp_signal with memcpy by directly multiplying the Orignal_signal many times with different portions of Com_array, as shown below:

MKL_Complex8* outptr = Expand_signal;
for (int i=0; i<4608; i++)
{
  for (int j=0; j<32; j++)
  {
    packed_vec_mult(1024, Original_signal+i*1024, Com_array+j*1024, outptr);
    outptr += 1024;
  }
}

This last step would give you another ~20% performance improvement compared to your implementation with vmcMul replaced by packed_vec_mult.

Finally since the loop performs operation on independent blocks, you may be able to get significantly higher throughput (but similar latency) by launching parallel computations over multiple thread, so that the CPU is always kept busy instead of waiting for data in transit to/from memory. My tests suggests somewhere around a factor 2 improvement, but results may differ depending on your specific machine.

SleuthEye
  • 14,379
  • 2
  • 32
  • 61