OpenMP improve SpeedUp Block decomposition

Question

So I have this code parallelized with OpenMP, using block decomposition for a matrix. I am trying to fix the load imbalance with these:

#define lowerb(id, p, n) ( id * (n/p) + (id < (n%p) ? id : n%p) )
#define numElem(id, p, n) ( (n/p) + (id < (n%p)) )
#define upperb(id, p, n) ( lowerb(id, p, n) + numElem(id, p, n) - 1 )
#define min(a, b) ( (a < b) ? a : b )
#define max(a, b) ( (a > b) ? a : b )

Here is the code. Sorry I have to post a picture, but there is no way I can reach my laptop until next week.

There is no much to explain about the code, each thread gets some rows and they share the amount of work. I use reduction(+:sum) to avoid the race condition.

All right, so here is the thing. I am only getting this Speed Up, and I am a 100% sure it can get much better:

The thing is I don't know what should I do or what might be the problem. Is it maybe some false sharing in the variable double *utmp? Or maybe I should be aiming for a Cyclic data decomposition?

Appreciate your help!

From all likelihood, your problem is memory bound, so once you've reached the full potential of your machine's memory bandwidth, you can't go any faster. See [this answer](https://stackoverflow.com/a/11579987/5239503) to better understand why. And BTW, it looks to me that you're just reinvented `static` scheduling. Just parellelize your `i` loop naively, it will be easier to understand for you and the compiler. — Gilles, Jan 10 '19 at 07:49
Your image of text [isn't very helpful](//meta.unix.stackexchange.com/q/4086). It can't be read aloud or copied into an editor, and it doesn't index very well, meaning that other users with the same problem are less likely to find the answer here. Please [edit] your post to incorporate the relevant text directly (preferably using copy+paste to avoid transcription errors). — Toby Speight, Jan 10 '19 at 17:49

OpenMP improve SpeedUp Block decomposition

0 Answers0