1

i'm making test about the performance of openmp,bu i find some strange results,here are my test code:

void test()
{
int a = 0;
clock_t t1 = clock();
int length =50000;
double *t3 = new double[length]();
double *t4 = new double[length]();
for (int i = 0; i <8000; i++)
{
for (int j = 0; j < length; j++)
    {
        t3[j] = t3[j] + t4[j];
    }
}
clock_t t2 = clock();
printf("Time = %d  %d\n", t2 - t1,omp_get_thread_num());
delete[] t3;
delete[] t4;
}

int main()
{
clock_t t1 = clock();
printf("In parallel region:\n");
#pragma omp parallel for
for (int j = 0; j < 8; j++)
{

    test(); 
}

clock_t t2 = clock();
printf("Total time = %d\n", t2 - t1);
printf("In sequential region:\n");
test();
printf("\n");

}

when i set the length=50000 or length=100000 or length=150000 respectively,the results are showed in the figure: enter image description here

it is strange that

  • the elapsed time is not a straight line up (the elapsed time when length=150000is almost 5 times of that when length=50000), while the amount of calculation is a straight line up.
  • it also strange that elapsed time for the test function in the parallel region doesn’t equal to the elapsed time for the test function in the sequential region when length=150000.

my cpu is intel Core i5-4590(4 cores) and platform is vs2013 ,win8

I’m eager to hope somebody can tell me the reason and how to solve this problem to improve the performance of openmp,thank you very much.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
Debo
  • 29
  • 4

1 Answers1

4

There is nothing strange here. Your code is memory bound and the slowdown when going from length=50000 to longer arrays is due to the data no longer being able to fit into the CPU last-level cache.

  • length=50000: data size is 4 threads x 2 arrays x 50000 elements x 8 bytes per element = 3.05 MiB < L3 cache size (6 MiB for i5-4590)
  • length=100000: data size is 6.10 MiB > L3 cache size
  • length=150000: data size is 9.16 MiB > L3 cache size

In the second case, the array is just slightly larger than the CPU cache, therefore the time difference is only a bit bigger than 2x. In the third case, half of the array data cannot be fitted into the cache and must be streamed from and to the main memory.

When the function is called from the main thread only, the memory used is 1/4 of what is used in the parallel region and the arrays fit entirely in the L3 cache for all three different lengths.

Check my answer to this question for more details.

Community
  • 1
  • 1
Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186