c - mutlithread image processing performance issue

Question

I have performance issue with a multithread implementation of a code compare to his "serial" analog. The serial code is about 0.02s faster than the parallel code.

Basically I am generating an image of a fractal. Each pixel is independent of the other and can be run in parallel. So what I do is to split the image so that each thread have about the same number of pixel to compute.

However, when I print CPU time to completion for each thread I have some strange behavior I don't understand:

Some thread takes about ten times longer than other
When I run the code several time, the same thread (computing the same part of the image) can be one time among the "slow thread" and at the next run among the "fast thread".

Here an exemple of the thread CPU time (computed with clock()) for 16 thread (I have tried a whole bench of thread number with always the same behavior) :

thread 0 duration : 0.012267 | nb iter : 378584 | nb of pixel computed : 405000
thread 1 duration : 0.015535 | nb iter : 583469 | nb of pixel computed : 405000
thread 15 duration : 0.008224 | nb iter : 1898601 | nb of pixel computed : 405000
thread 4 duration : 0.144990 | nb iter : 2467821 | nb of pixel computed : 405000
thread 13 duration : 0.072703 | nb iter : 2103618 | nb of pixel computed : 405000
thread 14 duration : 0.071292 | nb iter : 1901432 | nb of pixel computed : 405000
thread 11 duration : 0.111035 | nb iter : 3182682 | nb of pixel computed : 405000
thread 5 duration : 0.176053 | nb iter : 2970760 | nb of pixel computed : 405000
thread 2 duration : 0.232074 | nb iter : 984734 | nb of pixel computed : 405000
thread 10 duration : 0.174632 | nb iter : 3482202 | nb of pixel computed : 405000
thread 3 duration : 0.245894 | nb iter : 1502465 | nb of pixel computed : 405000
thread 12 duration : 0.173259 | nb iter : 2425202 | nb of pixel computed : 405000
thread 6 duration : 0.252091 | nb iter : 3680113 | nb of pixel computed : 405000
thread 8 duration : 0.238753 | nb iter : 4292033 | nb of pixel computed : 405000
thread 9 duration : 0.235892 | nb iter : 3993209 | nb of pixel computed : 405000
thread 7 duration : 0.263893 | nb iter : 4176359 | nb of pixel computed : 405000

Here I have 3 thread that take about 0.01s and other that take more than 0.2s. The number of pixel is perfectly equal for each thread, however number of iterations may vary depending on the area of the fractal. But some thread with very high number of iterations are faster than other with less iterations.

For instance thread 2 takes 0.23s for 900,000 iterations whereas thread 15 takes 0.008s for 1,900,000 iterations.

Why is that?

Could it be due to how I implemented the code or is it due to the OS / hardware of the machine and how it handles the threads?

I'm running this iMac with intel core i5 (so four CPU core) so I guess it can handle only 4 threads in parallel. I would understand if some threads take more time because they are queued and wait for CPU to be available but even so it can't explain the big difference I have. I would expect at max times 4 factor (16 thread / 4).

If it can help, here is the function where I create the thread :

int     generate_imgstr(t_mlx *tmlx, t_fractal *fractal)
{
 int   i;
 t_buffer *buffer;


 buffer = init_buffer(NB_BUFF, tmlx, fractal);

 i = -1;
 while (++i < NB_BUFF)
 {
  if (pthread_create(&buffer[i].thread, NULL, mandelbrot, (void*)&buffer[i]))
  {
   perror("pthread_create");
   return (EXIT_FAILURE);
  }
 }
 i = -1;
 while (++i < NB_BUFF)
 {
  if (pthread_join(buffer[i].thread, NULL))
  {
   perror("pthread_join");
   return (EXIT_FAILURE);
  }
 }
 free(buffer);
 return (EXIT_SUCCESS);
}

And here the function run by each thread:

void *mandelbrot(void *buffer)
{
 t_ipoint z;
 t_ipoint c;
 t_point2d p;
 int   j;
 int   max_iter;
 double  tmp;
 int   color;
 int   index;
 t_buffer *buff;

 buff = (t_buffer*)buffer;
 max_iter = 50;
 index = buff->start_pixel - 1;
 while (++index < buff->start_pixel + (buff->size / 4))
 {
  p.x = index % buff->win_width;
  p.y = index / buff->win_width;
  z.real = 0;
  z.imag = 0;
  c.real = (double)p.x / (double)buff->fractal->zoom + buff->fractal->start.real;
  c.imag = (double)p.y / (double)buff->fractal->zoom + buff->fractal->start.imag;
  j = -1;
  while ((++j < max_iter) && (z.real * z.real + z.imag * z.imag < 4.0))
  {
   tmp = z.real * z.real - z.imag * z.imag + c.real;
   z.imag = 2. * z.real * z.imag + c.imag;
   z.real = tmp;
  }
  color = get_color(j, buff->fractal->color_scale);
  fill_string(buff, &p, color);
 }
 pthread_exit(NULL);
}

Possible duplicate of [clock function in C++ with threads](https://stackoverflow.com/questions/35592502/clock-function-in-c-with-threads) and also [Single-threaded and multi-threaded code taking the same time](https://stackoverflow.com/questions/10316266/single-threaded-and-multi-threaded-code-taking-the-same-time) and a zillion others. — n. m. could be an AI, May 28 '18 at 16:29
Following your remark abour clock function, I tried to time the exection with the time function of the shell (zsh here). Here is the output: 1.26s user 0.12s system 83% cpu 1.665 total for serial & 1.31s user 0.14s system 149% cpu 0.969 total for parallel. I'm not sure how to interpret this result... — cylon86, May 28 '18 at 17:12
Amdahl's law - speedup is always limited by the part of the task that cannot benefit from the improvement — mrflash818, May 28 '18 at 17:31
0.969 is your **wall** time, 1.31+0.14=1.45 is **cpu** time, 1.45/0.969=1.49 = 149%, which means you have about 1.5 processors working at the same time on the average. — n. m. could be an AI, May 28 '18 at 17:31
The threads write into very different areas of memory, and some of them likely to suffer the cache misses. The single-threaded version seems much more cache friendly. — user58697, May 28 '18 at 21:26
@n.m. thanks for the explanation, quite interesting. So do I get it right that maybe here it's just that the code is not "heavy" enough to make a difference with multi threading? — cylon86, May 28 '18 at 21:50
It does runs about 1.3 times faster than the single threaded version, but this is far from being ~N times faster (N is the number of cores). Your next step should be profiling. — n. m. could be an AI, May 29 '18 at 14:36
Yes, it is indeed running faster. I understand better now all the time measurement (wall time, user, sys). When running the fractal and zooming in, the difference become obvious between serial and parallel. The CPU is loaded at ~90% in multithread vs 25% serial. I'm not sure to know how to start working on profiling, but I will look into that. If you have any starting point for me to read about this, it will be much appreciated. Many thanks for your help !! — cylon86, May 30 '18 at 15:06

c - mutlithread image processing performance issue

0 Answers0