0

So I have this code parallelized with OpenMP, using block decomposition for a matrix. I am trying to fix the load imbalance with these:

#define lowerb(id, p, n) ( id * (n/p) + (id < (n%p) ? id : n%p) )
#define numElem(id, p, n) ( (n/p) + (id < (n%p)) )
#define upperb(id, p, n) ( lowerb(id, p, n) + numElem(id, p, n) - 1 )
#define min(a, b) ( (a < b) ? a : b )
#define max(a, b) ( (a > b) ? a : b )

Here is the code. Sorry I have to post a picture, but there is no way I can reach my laptop until next week.

There is no much to explain about the code, each thread gets some rows and they share the amount of work. I use reduction(+:sum) to avoid the race condition.

enter image description here

All right, so here is the thing. I am only getting this Speed Up, and I am a 100% sure it can get much better:

enter image description here

The thing is I don't know what should I do or what might be the problem. Is it maybe some false sharing in the variable double *utmp? Or maybe I should be aiming for a Cyclic data decomposition?

Appreciate your help!

xBurnsed
  • 410
  • 4
  • 12
  • From all likelihood, your problem is memory bound, so once you've reached the full potential of your machine's memory bandwidth, you can't go any faster. See [this answer](https://stackoverflow.com/a/11579987/5239503) to better understand why. And BTW, it looks to me that you're just reinvented `static` scheduling. Just parellelize your `i` loop naively, it will be easier to understand for you and the compiler. – Gilles Jan 10 '19 at 07:49
  • Your image of text [isn't very helpful](//meta.unix.stackexchange.com/q/4086). It can't be read aloud or copied into an editor, and it doesn't index very well, meaning that other users with the same problem are less likely to find the answer here. Please [edit] your post to incorporate the relevant text directly (preferably using copy+paste to avoid transcription errors). – Toby Speight Jan 10 '19 at 17:49

0 Answers0