I have performance issue with a multithread implementation of a code compare to his "serial" analog. The serial code is about 0.02s faster than the parallel code.
Basically I am generating an image of a fractal. Each pixel is independent of the other and can be run in parallel. So what I do is to split the image so that each thread have about the same number of pixel to compute.
However, when I print CPU time to completion for each thread I have some strange behavior I don't understand:
- Some thread takes about ten times longer than other
- When I run the code several time, the same thread (computing the same part of the image) can be one time among the "slow thread" and at the next run among the "fast thread".
Here an exemple of the thread CPU time (computed with clock()) for 16 thread (I have tried a whole bench of thread number with always the same behavior) :
thread 0 duration : 0.012267 | nb iter : 378584 | nb of pixel computed : 405000
thread 1 duration : 0.015535 | nb iter : 583469 | nb of pixel computed : 405000
thread 15 duration : 0.008224 | nb iter : 1898601 | nb of pixel computed : 405000
thread 4 duration : 0.144990 | nb iter : 2467821 | nb of pixel computed : 405000
thread 13 duration : 0.072703 | nb iter : 2103618 | nb of pixel computed : 405000
thread 14 duration : 0.071292 | nb iter : 1901432 | nb of pixel computed : 405000
thread 11 duration : 0.111035 | nb iter : 3182682 | nb of pixel computed : 405000
thread 5 duration : 0.176053 | nb iter : 2970760 | nb of pixel computed : 405000
thread 2 duration : 0.232074 | nb iter : 984734 | nb of pixel computed : 405000
thread 10 duration : 0.174632 | nb iter : 3482202 | nb of pixel computed : 405000
thread 3 duration : 0.245894 | nb iter : 1502465 | nb of pixel computed : 405000
thread 12 duration : 0.173259 | nb iter : 2425202 | nb of pixel computed : 405000
thread 6 duration : 0.252091 | nb iter : 3680113 | nb of pixel computed : 405000
thread 8 duration : 0.238753 | nb iter : 4292033 | nb of pixel computed : 405000
thread 9 duration : 0.235892 | nb iter : 3993209 | nb of pixel computed : 405000
thread 7 duration : 0.263893 | nb iter : 4176359 | nb of pixel computed : 405000
Here I have 3 thread that take about 0.01s and other that take more than 0.2s. The number of pixel is perfectly equal for each thread, however number of iterations may vary depending on the area of the fractal. But some thread with very high number of iterations are faster than other with less iterations.
For instance thread 2 takes 0.23s for 900,000 iterations whereas thread 15 takes 0.008s for 1,900,000 iterations.
Why is that?
Could it be due to how I implemented the code or is it due to the OS / hardware of the machine and how it handles the threads?
I'm running this iMac with intel core i5 (so four CPU core) so I guess it can handle only 4 threads in parallel. I would understand if some threads take more time because they are queued and wait for CPU to be available but even so it can't explain the big difference I have. I would expect at max times 4 factor (16 thread / 4).
If it can help, here is the function where I create the thread :
int generate_imgstr(t_mlx *tmlx, t_fractal *fractal)
{
int i;
t_buffer *buffer;
buffer = init_buffer(NB_BUFF, tmlx, fractal);
i = -1;
while (++i < NB_BUFF)
{
if (pthread_create(&buffer[i].thread, NULL, mandelbrot, (void*)&buffer[i]))
{
perror("pthread_create");
return (EXIT_FAILURE);
}
}
i = -1;
while (++i < NB_BUFF)
{
if (pthread_join(buffer[i].thread, NULL))
{
perror("pthread_join");
return (EXIT_FAILURE);
}
}
free(buffer);
return (EXIT_SUCCESS);
}
And here the function run by each thread:
void *mandelbrot(void *buffer)
{
t_ipoint z;
t_ipoint c;
t_point2d p;
int j;
int max_iter;
double tmp;
int color;
int index;
t_buffer *buff;
buff = (t_buffer*)buffer;
max_iter = 50;
index = buff->start_pixel - 1;
while (++index < buff->start_pixel + (buff->size / 4))
{
p.x = index % buff->win_width;
p.y = index / buff->win_width;
z.real = 0;
z.imag = 0;
c.real = (double)p.x / (double)buff->fractal->zoom + buff->fractal->start.real;
c.imag = (double)p.y / (double)buff->fractal->zoom + buff->fractal->start.imag;
j = -1;
while ((++j < max_iter) && (z.real * z.real + z.imag * z.imag < 4.0))
{
tmp = z.real * z.real - z.imag * z.imag + c.real;
z.imag = 2. * z.real * z.imag + c.imag;
z.real = tmp;
}
color = get_color(j, buff->fractal->color_scale);
fill_string(buff, &p, color);
}
pthread_exit(NULL);
}