Fusing a triangle loop for parallelization, calculating sub-indices

Question

A common technique in parallelization is to fuse nested for loops like this

for(int i=0; i<n; i++) {
    for(int j=0; j<n; j++) {

to

for(int x=0; x<n*n; x++) {
    int i = x/n; int j = x%n;

I'm wondering how I can do this to fuse a triangle loop like this

for(int i=0; i<n; i++) {
   for(int j=0; j<i+1; j++) {

This has n*(n+1)/2 iterations. Let's call the fused iteration x. Using the quadratic formula I have come up with this:

for(int x=0; x<(n*(n+1)/2); x++) {      
    int i  = (-1 + sqrt(1.0+8.0*x))/2;
    int j = x - i*(i+1)/2;

Unlike fusing the square loop this requires using the sqrt function and conversions from int to float and from float to int.

I'm wondering if there is a simpler or more efficient way of doing this? For example a solution which does not require the sqrt function or conversions from int to float or float to int.

Edit: I don't want a solution which depends on previous or next iterations. I only want solutions like int i = funci(x) and int j = funcj(x,i)

Here is some code showing that this works:

#include <stdio.h>
#include <math.h>
int main() {
    int n = 5;
    int cnt = 0;
    for(int i=0; i<n; i++) {
        for(int j=0; j<i+1; j++) {
            printf("%d: %d %d\n", cnt++, i,j);      
        }
    } printf("\n");

    int nmax = n*(n+1)/2;
    for(int x=0; x<nmax; x++) {     
        int i  = (-1 + sqrt(1.0+8.0*x))/2;
        int j = x - i*(i+1)/2;
        printf("%d: %d %d\n", x,i,j);
    }       
}

Why? If it's for performance, a call to `sqrt()` in the inner-most loop seems like a very negative trade-off. — unwind, Jun 03 '14 at 11:23
@unwind, it could be used for fusing a parallel for loop. Anyway, fusing a square loop requires a division (i=x/n, j = x%n) which is not that much slower than the sqrt instruction on modern CPUs. But that's sorta the point of the question. Can I do this without the `sqrt` function? — Z boson, Jun 03 '14 at 11:26
The `sqrt` isn't the only expensive function, there are conversions to and from doubles involved. — MSalters, Jun 03 '14 at 11:29
Note that the use matters; the first two answers have a sequential update of i and j which doesn't parallelize. — MSalters, Jun 03 '14 at 11:31
@MSalters, yes you're correct that there are to and from double conversions. I did not think of that. But I think I could eliminate the conversion to double by making x a double then I would only have to convert from double to int. — Z boson, Jun 03 '14 at 11:47
That breaks parallelization outright. For large enough doubles, x==x+1.0 — MSalters, Jun 03 '14 at 11:49
If you are only looking for speed optimization, and if n if reasonably small, could you compute all the iterations tuples, fit them in a look up table and iterate over this array of tuples? This would of course increase the size of your code (thus why n needs to be small). — Emilien, Jun 03 '14 at 12:19
@Emilien: It's fairly obvious that `n` is small; for larger `n` you'd just parallelize the outer loop. (Although you'd again use the trick I outlined in my second answer, to combine the `i`th and `n-i`th outer loops since that makes the inner loop constant length.) — MSalters, Jun 03 '14 at 12:39
Very related: http://stackoverflow.com/questions/17406593/triangular-array/17406653#17406653 Just pick an indexing method for your triangle cells and use that in your parallelized for. — Timothy Shields, Jun 03 '14 at 15:51
@TimothyShields, thanks for the interesting link. But that's going from i,j to k (2D to 1D) I want to map k to i,j (1D to 2D). — Z boson, Jun 03 '14 at 16:01

MSalters · Accepted Answer · 2014-06-04T06:31:47.793

8

Considering that you're trying to fuse a triangle with the intent of parallelizing, the non-obvious solution is to choose a non-trivial mapping of x to (i,j):

j |\ i ->
  | \             ____
| |  \    =>    |\\   |
V |___\         |_\\__|

After all, you're not processing them in any special order, so the exact mapping is a don't care.

So calculate x->i,j as you'd do for a rectangle, but if i > j then { i=N-i, j = N-j } (mirror Y axis, then mirror X axis).

   ____
 |\\   |      |\           |\
 |_\\__|  ==> |_\  __  =>  | \
                  / |      |  \
                 /__|      |___\

edited Jun 04 '14 at 06:31

answered Jun 03 '14 at 12:07

MSalters

173,980
10
155
350

I think there is a typo: `but if i > N/2` should read `i > j`, right? – Massimiliano Jun 03 '14 at 12:36
Thank you, that's the kind of answer I was looking for. That's a clever solution. OpenMP has a method to fuse nested loops. I don't know if it can fuse triangle loops but I was mostly wondering how someone would go about doing it manually since I have done it several times for square loops but never for a triangular loop. – Z boson Jun 03 '14 at 14:43
I finally worked through this and used your solution in an [answer](http://stackoverflow.com/a/33836073/2542702) (see towards the end of the answer). The triangular loop was not exactly the same as in my question. My question is the lower left triangle with diagonal whereas in the other question the triangle is upper right and excludes the diagonal. The mapping was not exactly as you say but close. It became `if(j<=i) { i = n - i - 2; j = n - j - 1; }`. How did you come up with such a clever an answer? I'm really impressed! I'm surprised it does not have more votes. – Z boson Nov 23 '15 at 20:54
@Zboson: The geometrical transformation from triangle to rectangle is fairly well-known as a convenient way to prove `A = h*w/2` for a triangle. Programming is math, really. – MSalters Nov 23 '15 at 23:47
+1 for the illustration. I recently came up with a similar solution for a segmented (much larger) triangle. My own solution was to use an iteration counter that tells each thread how much to skip forward. i.e j = (i + iter) % blockSize; – ofer.sheffer Apr 12 '17 at 16:16
You do need to be a little careful when j is odd, otherwise you end up double processing a few entries. – Michael Anderson Jul 03 '17 at 07:46

score 1 · Answer 2 · answered Jun 03 '14 at 11:27

1

The most sane form is of course the first form.

That said, the fused form is better done with conditionals:

int i = 0; int j = 0;
for(int x=0; x<(n*(n+1)/2); x++) {
  // ...
  ++j;
  if (j>i)
  {
    j = 0;
    ++i;
  }
}

answered Jun 03 '14 at 11:27

MSalters

173,980
10
155
350

That won't work (easily) in parallel for loop. I don't want solutions that depend on previous or next iterations. They must be independent. I'm sorry for not stating that in my question. I updated my question. – Z boson Jun 03 '14 at 11:31

score 0 · Answer 3 · answered Jun 03 '14 at 11:50

0

I'm wondering if there is a simpler or more efficient way of doing this?

Yes, the code you had to begin with. Please keep the following in mind:

There exists no case where floating point arithmetic is ever faster than plain integers.
There does however exist plenty of cases where floating point is far slower than plain integers. FPU or no FPU.
Float variables are generally larger than plain integers on most systems and therefore slower for that reason alone.
The first version of the code is likely most friendly to the cache memory. As for any case of manual optimization, this depends entirely on what CPU you are using.
Division is generally slow on most systems, no matter if done to plain integers or floats.
Any form of complex arithmetic is going to be slower than simple counting.

So your second example is pretty much guaranteed to be far slower than the first example, for any given CPU in the world. In addition, it is also completely unreadable.

answered Jun 03 '14 at 11:50

Lundin

195,001
40
254
396

Are you arguing that fusing square loops is not helpful in general in parallelization? – Z boson Jun 03 '14 at 11:53
@Zboson I'm rather arguing that complex code will never be faster than simple code. And _it doesn't make any sense whatsoever_ to attempt manual optimization before you have benchmarked your program and found a bottleneck, and you have a specific system and CPU in mind, and you have very in-depth knowledge of this system/CPU. Though it turns out that what you have written will always be slow on any CPU I've ever heard of, from ancient 8-bit MCUs from the 70s to modern 64-bit monsters. – Lundin Jun 03 '14 at 12:03
1

I understand all your points. I probably should not have used the word more efficient. Typically the calculation/load that will be calculated in the loop will take much more time than the time to calculate the iterator so it's not so important how efficient the fusing is. But fusing can be useful for load distribution. I am mostly interested in solutions to fusing the loop different than the one I came up with which as you said is "completely unreadable". And if the solution is more efficient all the better. – Z boson Jun 03 '14 at 21:28

Fusing a triangle loop for parallelization, calculating sub-indices

3 Answers3

Linked