Neighbor index computation for diagonally flattened matrix

Question

I have a 2D matrix stored in a flat buffer along diagonals. For example a 4x4 matrix would have its indexes scattered like so:

 0   2   5   9
 1   4   8  12
 3   7  11  14
 6  10  13  15

With this representation, what is the most efficient way to calculate the index of a neighboring element given the original index and a X/Y offset? For example:

// return the index of a neighbor given an offset
int getNGonalNeighbor(const size_t index,
                      const int x_offset,
                      const int y_offset){
    //...
}

// for the array above:
getNGonalNeighbor(15,-1,-1); // should return 11
getNGonalNeighbor(15, 0,-1); // should return 14
getNGonalNeighbor(15,-1, 0); // should return 13
getNGonalNeighbor(11,-2,-1); // should return 1

We assume here that overflow never occurs and there is no wrap-around.

I have a solution involving a lot of triangular number and triangular root calculations. It also contains a lot of branches, which I would prefer to replace with algebra if possible (this will run on GPUs where diverging control flow is expensive). My solution is working but very lengthy. I feel like there must be a much simpler and less compute intensive way of doing it.

Maybe it would help me if someone can put a name on this particular problem/representation.

I can post my full solution if anyone is interested, but as I said it is very long and relatively complicated for such a simple task. In a nutshell, my solution does:

translate the original index into a larger triangular matrix to avoid dealing with 2 triangles (for example 13 would become 17)

For the 4x4 matrix this would be:

0   2   5   9   14  20  27
1   4   8   13  19  26  
3   7   12  18  25  
6   11  17  24  
10  16  23
15  22  
21

calculate the index of the diagonal of the neighbor in this representation using the manhattan distance of the offset and the triangular root of the index.
calculate the position of the neighbor in this diagonal using the offset
translate back to the original representation by removing the padding.

For some reason this is the simplest solution i could come up with.

Edit:

having loop to accumulate the offset:

I realize that given the properties of the triangle numbers, it would be easier to split up the matrix in two triangles (let's call 0 to 9 'upper triangle' and 10 to 15 'lower triangle') and have a loop with a test inside to accumulate the offset by adding one while in the upper triangle and subtracting one in the lower (if that makes sense). But for my solution loops must be avoided at all cost, especially loops with unbalanced trip counts (again, very bad for GPUs).

So I am looking more for an algebraic solution rather than an algorithmic one.

Building a lookup table:

Again, because of the GPU, it is preferable to avoid building a lookup table and have random accesses in it (very expensive). An algebraic solution is preferable.

Properties of the matrix:

The size of the matrix is known.
For now I only consider square matrix, but a solution for rectangular ones as well would be nice.
as the name of the function in my example suggests, extending the solution to N-dimensional volumes (hence N-gonal flattening) would be a big plus too.

look at the patterns: 0 2 will always be the first elements in the first row, then for each (0,j) s.t. j is greater than 1, it increases +1( 2+3 =5, +4 =9, +5 = 14...) same for first column. after thinking a bit more, the pattern is given when looking at the column number and moving column-column. ill try to explain better: 0,1,3,6,10 increases x+1 each time. for column 1(starts with 2) you start with a +2,+3,+4, same for each column — Ariel Pinchover, Apr 17 '13 at 20:27
It seems you are describing the behavior of the upper triangle (for index less than 9 in this case). The problem is that this changes for the lower triangle, and gets messy if the index is in one triangle and the neighbor in the other. Which i why i wrap the whole matrix in a triangular matrix and use the properties you described. — Thibaut, Apr 17 '13 at 20:31
By upper and lower triangle i mean if you split the matrix into 2 triangles: from 0 to 9 and 10 to 15. What you are describing looks like triangle number sequence, which breaks when you go from one of these triangles to the other. If you look at 10 for example, the element above is 7, and it is on the 4th diagonal, but 10-4=6. — Thibaut, Apr 17 '13 at 20:36
Yes, it does. Why is where it gets messy. I think I know where you are going with this, because I had the same reasoning as you ;). I should have mentioned in my post that having a loop with a test in it is not an acceptable solution. — Thibaut, Apr 17 '13 at 20:42
so figure before the loop where the diagonal is and for each side of the diagonal do what you need; break it to the two sides of it. — Ariel Pinchover, Apr 17 '13 at 20:44
So your question is if theres a simpler way to do it, and if so what it is? — Ariel Pinchover, Apr 17 '13 at 20:54
Yes, i guess i am hoping for an obscure property of triangle numbers which would allow me to calculate how to move from one triangle to the other without calculating tons of temporaries. — Thibaut, Apr 17 '13 at 21:02
do you also know the size of the matrix? and doex it have to be a NxN matrix? — Ariel Pinchover, Apr 17 '13 at 21:05
Yes, good point. The size is known. For now i would be happy with a solution only for a square matrix, but a solution working for rectangular one as well would be nice ;) — Thibaut, Apr 17 '13 at 21:07
That was part of my question. The "diagonally flattened" name in the title is the best description i could think of, but there must be a proper name. — Thibaut, Apr 17 '13 at 21:12
im not a native, why did you come up with diagonally flattened? — Ariel Pinchover, Apr 17 '13 at 21:14
because it is one way to flatten the matrix. Flattening means you decrease the number of dimensions: here from 2 to one. There are different ways to do that: row major (left to right), column major (top to bottom), and here diagonal. It represents the order used to store the elements. Google row and column major order to know more, those are very common. — Thibaut, Apr 17 '13 at 21:17
http://www.mathematische-basteleien.de/triangularnumber.htm that http://stackoverflow.com/questions/4803180/mapping-elements-in-2d-upper-triangle-and-lower-triangle-to-linear-structure and http://stackoverflow.com/questions/242711/algorithm-for-index-numbers-of-triangular-matrix-coefficients — Ariel Pinchover, Apr 17 '13 at 21:19
Triangular numbers is the name of the observation you made for your very first comment. The sequence goes 1+2+3+...+n-1. Which is equal to n(n-1)/2. I used that all over the place in my solution. I used triangular root as well, which gives the diagonal index (or length) given the position. I am aware of those, but I need to combine them nicely to solve this problem. — Thibaut, Apr 17 '13 at 21:22
the solutions you reference are scatter/gather operations for this particular layout. It does not solve my particular problem. — Thibaut, Apr 17 '13 at 21:23
you said its for GPUs, arent they supposed to work with (relatively speaking) ease with calculations? — Ariel Pinchover, Apr 17 '13 at 21:24
not really. My problems are memory bound. I can afford very complex index computation as long as it optimizes memory accesses. This is why i am using this layout: it increases coalescing by a huge factor, but index computation becomes a nightmare. — Thibaut, Apr 17 '13 at 21:26
http://jwilson.coe.uga.edu/EMAT6680Fa06/Kitchings/CK6690/Figurate%20Numbers.html how about this one? — Ariel Pinchover, Apr 17 '13 at 21:39
thanks for doing some research :) This seems to be another explanation of triangular numbers and an introduction to ngonal numbers. I will read it in more detail but i don't think it will give us any hint to solve this problem. I think I already have the naive implementation using only triangular numbers and roots. — Thibaut, Apr 17 '13 at 21:43
i think if you divide the original squared matrix to 4 triangles meeting in the middle each will hold its own properties to calculate the neighbors — Ariel Pinchover, Apr 17 '13 at 21:46
I don't know how that would simplify the problem, but if you feel like giving it a try and code it let me know if it works ;) — Thibaut, Apr 17 '13 at 21:48
that http://docs.sympy.org/dev/_modules/sympy/matrices/matrices.html after doign some more research, triangular matrices are best for computer programming. — Ariel Pinchover, Apr 17 '13 at 22:01
Are your matrices of limited size? What do you think about using lookup tables? — zch, Apr 17 '13 at 22:10
I thought abt that too, but it seems cumbersome. The index offset depends on the offset value in each dimension and the actual input index. There is no good way of factorizing it enough for a lookup table in my opinion. The best we can do is a lookup table for each manatthan distance on each diagonal, which would be quite heavy and there is still some more computation to perform. To answer your question, the matrix has a fixed size, yes. And the offsets are not fixed but are within bounds of the matrix. Do you have anything specific in mind for the lookup table? — Thibaut, Apr 17 '13 at 22:16
@zch also, going N-dimensional would probably make the lookup table very complicated. — Thibaut, Apr 17 '13 at 22:17
@zch: oh, and also, this is for a memory bound GPU implementation, reading randomly into a lookup table cancels out all benefit of this layout. — Thibaut, Apr 17 '13 at 22:19
@Thibaut, I was just thinking about something simple like `NGONAL_ORD[X[index]+x_offset, Y[index]+y_offset]`. — zch, Apr 17 '13 at 22:48
@zch: Array subscript of array subscript is the black sheep of GPGPU programming ;) — Thibaut, Apr 17 '13 at 22:53

BLUEPIXY · Answer 1 · 2013-04-17T23:37:05.067

Table lookup

#include <stdio.h>

#define SIZE 16
#define SIDE  4  //sqrt(SIZE)

int table[SIZE];
int rtable[100];// {x,y| x<=99, y<=99 }

void setup(){
    int i, x, y, xy, index;//xy = x + y

    x=y=xy=0;
    for(i=0;i<SIZE;++i){
        table[i]= index= x*10 + y;
        rtable[x*10+y]=i;
        x = x + 1; y = y - 1;//right up
        if(y < 0 || x >= SIDE){
            ++xy;
            x = 0;
            y = xy;;
            while(y>=SIDE){
                ++x;
                --y;
            }
        }
    }
}
int getNGonalNeighbor(int index, int offsetX, int offsetY){
    int x,y;
    x=table[index] / 10 + offsetX;
    y=table[index] % 10 + offsetY;
    if(x < 0 || x >= SIDE || y < 0 || y >= SIDE) return -1; //ERROR
    return rtable[ x*10+y ];
}

int main() {
    int i;
    setup();
    printf("%d\n", getNGonalNeighbor(15,-1,-1));
    printf("%d\n", getNGonalNeighbor(15, 0,-1));
    printf("%d\n", getNGonalNeighbor(15,-1, 0));
    printf("%d\n", getNGonalNeighbor(11,-2,-1));
    printf("%d\n", getNGonalNeighbor(0, -1,-1));

    return 0;
}

don't use table version.

#include <stdio.h>

#define SIZE 16
#define SIDE  4

void num2xy(int index, int *offsetX, int *offsetY){
    int i, x, y, xy;//xy = x + y

    x=y=xy=0;
    for(i=0;i<SIZE;++i){
        if(i == index){
            *offsetX = x;
            *offsetY = y;
            return;
        }
        x = x + 1; y = y - 1;//right up
        if(y < 0 || x >= SIDE){
            ++xy;
            x = 0;
            y = xy;;
            while(y>=SIDE){
                ++x;
                --y;
            }
        }
    }
}
int xy2num(int offsetX, int offsetY){
    int i, x, y, xy, index;//xy = x + y

    x=y=xy=0;
    for(i=0;i<SIZE;++i){
        if(offsetX == x && offsetY == y) return i;
        x = x + 1; y = y - 1;//right up
        if(y < 0 || x >= SIDE){
            ++xy;
            x = 0;
            y = xy;;
            while(y>=SIDE){
                ++x;
                --y;
            }
        }
    }
    return -1;
}
int getNGonalNeighbor(int index, int offsetX, int offsetY){
    int x,y;

    num2xy(index, &x, &y);

    return xy2num(x + offsetX, y + offsetY);
}

int main() {
    printf("%d\n", getNGonalNeighbor(15,-1,-1));
    printf("%d\n", getNGonalNeighbor(15, 0,-1));
    printf("%d\n", getNGonalNeighbor(15,-1, 0));
    printf("%d\n", getNGonalNeighbor(11,-2,-1));
    printf("%d\n", getNGonalNeighbor(0, -1,-1));

    return 0;
}

@Thibaut That's right. Bucket size to represent the original coordinates rather than offset. — BLUEPIXY, Apr 17 '13 at 22:57
Thanks, this would work. However the payload of the lookup table seems to be as large as the original matrix. In effect, this behaves almost like transposing the matrix to another layout. This is definitely a step forward, but i am looking for a more algebraic solution. — Thibaut, Apr 17 '13 at 22:57
Also, to give you more context for the solution I am looking for: this is for GPGPU programming, where random accesses in memory are extremely costly, as opposed to 'aligned' (coalesced) ones. Which is why I am using this layout in the first place: with a traditional layout the memory accesses would be random for the problems I am solving, but not with the layout I am using here. However your lookup table re-introduces this random accesses via the lookup table. — Thibaut, Apr 17 '13 at 23:02
The second solution is interesting. It still looks a lot like layout transformation though. Your `num2xy` and `xy2num` seems to be scatter/gather operations in disguise. Which is interesting because they have algebraic equivalents which would get rid of all those loops and branching. I will definitely dig this deeper, thanks. — Thibaut, Apr 17 '13 at 23:56
I think num2xy and xy2num can, also change the shape of searching a base point (eg) 15. I do not know a little algebraic function. — BLUEPIXY, Apr 18 '13 at 00:09

Thibaut · Accepted Answer · 2013-04-18T01:04:35.330

I actually already had the elements to solve it somewhere else in my code. As BLUEPIXY's solution hinted, I am using scatter/gather operations, which I had already implemented for layout transformation.

This solution basically rebuilds the original (x,y) index of the given element in the matrix, applies the index offset and translates the result back to the transformed layout. It splits the square in 2 triangles and adjust the computation depending on which triangle it belongs to.

It is an almost entirely algebraic transformation: it uses no loop and no table lookup, has a small memory footprint and little branching. The code can probably be optimized further.

Here is the draft of the code:

#include <stdio.h>
#include <math.h>

// size of the matrix
#define SIZE 4

// triangle number of X
#define TRIG(X) (((X) * ((X) + 1)) >> 1)
// triangle root of X
#define TRIROOT(X) ((int)(sqrt(8*(X)+1)-1)>>1);

// return the index of a neighbor given an offset
int getNGonalNeighbor(const size_t index,
                      const int x_offset,
                      const int y_offset){
    // compute largest upper triangle index
    const size_t upper_triangle = TRIG(SIZE);

    // position of the actual element of index
    unsigned int x = 0,y = 0;

    // adjust the index depending of upper/lower triangle.
    const size_t adjusted_index = index < upper_triangle ?
                index :
                SIZE * SIZE - index - 1;

    // compute triangular root
    const size_t triroot = TRIROOT(adjusted_index);
    const size_t trig = TRIG(triroot);
    const size_t offset = adjusted_index - trig;

    // upper triangle
    if(index < upper_triangle){
        x = offset;
        y = triroot-offset;
    }
    // lower triangle
    else {
        x = SIZE - offset - 1;
        y = SIZE - (trig + triroot + 1 - adjusted_index);
    }

    // adjust the offset
    x += x_offset;
    y += y_offset;

    // manhattan distance
    const size_t man_dist = x+y;

    // calculate index using triangular number
    return TRIG(man_dist) +
            (man_dist >= SIZE ? x - (man_dist - SIZE + 1) : x) -
            (man_dist > SIZE ? 2* TRIG(man_dist - SIZE) : 0);
}

int main(){
    printf("%d\n", getNGonalNeighbor(15,-1,-1)); // should return 11
    printf("%d\n", getNGonalNeighbor(15, 0,-1)); // should return 14
    printf("%d\n", getNGonalNeighbor(15,-1, 0)); // should return 13
    printf("%d\n", getNGonalNeighbor(11,-2,-1)); // should return 1
}

And the output is indeed:

If you think this solution looks over complicated and inefficient, I remind you that the target here is GPU, where computation costs virtually nothing compared to memory accesses, and all index computations are computed at the same time using massively parallel architectures.

Yes, unless someone can propose an even simpler solution involving less computation. — Thibaut, Apr 18 '13 at 09:46

Neighbor index computation for diagonally flattened matrix

2 Answers2