Implement numpy's triu_indices with avx in c++

Question

I would like to implement numpy.triu_indices(a, 1)(note that the second argument is 1) in c++ with avx intrinsics. The code snippet below is the unvectorized version of the code I came up with. Here, a is the length (int), and first and second are the two output arrays

void triu_indices(int a, int* first, int* second)
{
    int index = 0;
    for (int i=0; i < a; i++)
    {

        for(int j = i+1; j < a; j++)
        {
            first[index] = i;
            second[index] = j;
            index++;
        }

    }
}

As an example output, if I give a=4, then

first = [0,0,0,1,1,2]
second = [1,2,3,2,3,3]

Now, I would like to implement this fully in AVX2 (that is in a vectorized way). Ultimately, the function will be run over an entire array of ints, which will supply the variable a to the function, and the output arrays first and second will be stored in two parent arrays.

Can you please give me some useful hints (or a code snippet) as how to vectorize this function with explicit AVX2 instrinsics (that is, not depending on compiler auto-vectorization)? Sorry if this is a noob question, as I have started learning AVX recently.

Are you sure you actually want to create an array of indices instead of generating them on the fly when needed? You can generate an i,j pair given a linear index using [How to convert triangular matrix indexes in to row, column coordinates?](https://stackoverflow.com/q/40950460), or [algorithm for index numbers of triangular matrix coefficients](https://stackoverflow.com/q/242711), and more importantly you can *loop* over a triangular matrix without any expensive operations. For i,j to linear, see [algorithm of addressing a triangle matrices memory](https://stackoverflow.com/q/49165711). — Peter Cordes, May 25 '18 at 06:08
AVX is predominantly floating point, so it’s unlikely to help for this. You may be able to use AVX2, which has integer support, however, assuming your CPU has it. Otherwise stick to SSE. — Paul R, May 25 '18 at 06:10
@hpaulj Yes, apparently, it calls numpy.tri(), which itself calls numpy.greater_equal.outer(). At this point, I am stuck, as I couldn't find the reference to how it does it — Roy_123, May 25 '18 at 06:14
If auto-vectorizing with triangular matrices, you might want to pad the indexing so each row starts at a 32-byte aligned boundary. i.e. round up the length of each row to a multiple of 8 `float` elements, a whole AVX vector. This also makes it easier to loop over a matrix with AVX vectors: you can store garbage into the padding at the end of a row instead of having the last vector of a row include some elements from the start of the next row. — Peter Cordes, May 25 '18 at 06:15
@PeterCordes I actually need the array of indices. But I will look into your references — Roy_123, May 25 '18 at 06:16
@PaulR yes, I can use avx2. I wanted to put that tag as well, but unfortunately, SO allows 5 tags only — Roy_123, May 25 '18 at 06:17
Without offsets, the `numpy` function is just: `np.where(np.arange(N)>=np.arange(N)[:,None])`. `np.tri` generates the boolean mask that's true in upper triangle, and `np.where` finds the coordinates of the True values. It's just combining a couple of whole array building blocks. — hpaulj, May 25 '18 at 06:36
@hpaulj Thanks. That's a great clue. But won't that be more inefficient than the current current two loops version? — Roy_123, May 25 '18 at 06:56

score 3 · Accepted Answer · answered May 25 '18 at 07:02

First of all, make sure you really need to do this, and actually want arrays of indices as an end result, not as part of keeping track of data in a triangular matrix. AVX2 has gather support, and AVX512 has scatter support, but introducing an array of indices makes SIMD much worse.

For looping over triangular matrices, and for i,j to linear, see algorithm of addressing a triangle matrices memory using assembly. (You might want to pad the indexing so each row starts at a 32-byte aligned boundary. i.e. round up the length of each row to a multiple of 8 float elements, a whole AVX vector. This also makes it easier to loop over a matrix with AVX vectors: you can store garbage into the padding at the end of a row instead of having the last vector of a row include some elements from the start of the next row.)

For linear -> i,j, the closed-form formula includes sqrt (also a C++ version), so it's possible that array lookups could be useful, but really you should avoid doing this at all. (e.g. if looping over a triangular matrix in packed format, keep track of where you are in i,j as well as linear so you don't need a lookup when you find an element you're looking for.)

If you do need this:

For large arrays, it breaks down into whole vectors pretty nicely, getting tricky only at the ends of rows.

You might use a pre-defined vector constant for the special case of the last corner where you have multiple triangle rows within the same vector of 4 or 8 int elements.

first = [0,0,0,1,1,2]

With a larger triangle, we're storing large runs of the same number (like memset), then a slightly-shorter run of the next number, etc. i.e. storing a whole row of 0s is easy. For all but the last couple rows, these runs are larger than 1 vector element.

second = [1,2,3,2,3,3]

Again, within a single row it's a simple pattern to vectorize. To store an increasing sequence, start with a vector of {1,2,3,4} and increment it with SIMD add with {4,4,4,4}, i.e. _mm_set1_epi32(1). For 256-bit AVX2 vectors, _mm256_set1_epi32(8) to increment an 8-element vector by 8.

So within the inner-most loop, you're just storing one invariant vector, and using _mm256_add_epi32 on another and storing it to the other array.

Compilers can already auto-vectorize your function pretty decently, although the end-of-row handling is a lot more complex than you could do by hand. With your code on the Godbolt compiler explorer (with __restrict to tell the compiler the output arrays don't overlap, and __builtin_assume_aligned to tell the compilers that they're aligned), we get an inner loop like this (from gcc):

.L4:                                       # do {
    movups  XMMWORD PTR [rcx+rax], xmm0      # _mm_store_si128(&second[index], xmm0)
    paddd   xmm0, xmm2                       # _mm_add_epi32
    movups  XMMWORD PTR [r10+rax], xmm1      # _mm_store_si128(&second[index], counter_vec)
    add     rax, 16                          # index += 4  (16 bytes)
    cmp     rax, r9
    jne     .L4                            # }while(index != end_row)

If I have time, I might write this up in more detail, including better handling of the ends of rows. e.g. partially-overlapping store that ends at the end of the row is often good. Or unroll the outer loop so the inner loops have a repeating pattern of cleanup.

Calculating the starting vectors for the next outer-loop iteration could be done with something like:

vsecond = _mm256_add_epi32(vsecond, _mm256_set1_epi32(1));
vfirst  = _mm256_add_epi32(vfirst, _mm256_set1_epi32(1));

i.e. turn {0,0,0,0,...} into {1,1,1,1,...} by adding a vector of all ones. And turn {1,2,3,4,5,6,7,8} into {2,3,4,5,6,7,8,9} by adding a vector of all ones.

Thanks a lot! Unfortunately, I do need to store the indices. I will edit the question a bit to explain what I intend to do with this. I will be looking forward to your more detailed explanation. — Roy_123, May 25 '18 at 07:46
@Roy_123: I know *you* need them; you said so in comments. The first part of this SO answer is for future readers that get here when searching on how to do SIMD with triangular matrices. The 2nd half of my answer already shows you how to vectorize most of the work efficiently. — Peter Cordes, May 25 '18 at 07:53

Implement numpy's triu_indices with avx in c++

1 Answers1

If you do need this: