So I have this code parallelized with OpenMP, using block decomposition for a matrix. I am trying to fix the load imbalance with these:
#define lowerb(id, p, n) ( id * (n/p) + (id < (n%p) ? id : n%p) )
#define numElem(id, p, n) ( (n/p) + (id < (n%p)) )
#define upperb(id, p, n) ( lowerb(id, p, n) + numElem(id, p, n) - 1 )
#define min(a, b) ( (a < b) ? a : b )
#define max(a, b) ( (a > b) ? a : b )
Here is the code. Sorry I have to post a picture, but there is no way I can reach my laptop until next week.
There is no much to explain about the code, each thread gets some rows and they share the amount of work. I use reduction(+:sum)
to avoid the race condition.
All right, so here is the thing. I am only getting this Speed Up, and I am a 100% sure it can get much better:
The thing is I don't know what should I do or what might be the problem. Is it maybe some false sharing in the variable double *utmp
? Or maybe I should be aiming for a Cyclic data decomposition?
Appreciate your help!