First of all, make sure you really need to do this, and actually want arrays of indices as an end result, not as part of keeping track of data in a triangular matrix. AVX2 has gather support, and AVX512 has scatter support, but introducing an array of indices makes SIMD much worse.
For looping over triangular matrices, and for i,j to linear, see algorithm of addressing a triangle matrices memory using assembly. (You might want to pad the indexing so each row starts at a 32-byte aligned boundary. i.e. round up the length of each row to a multiple of 8 float elements, a whole AVX vector. This also makes it easier to loop over a matrix with AVX vectors: you can store garbage into the padding at the end of a row instead of having the last vector of a row include some elements from the start of the next row.)
For linear -> i,j, the closed-form formula includes sqrt
(also a C++ version), so it's possible that array lookups could be useful, but really you should avoid doing this at all. (e.g. if looping over a triangular matrix in packed format, keep track of where you are in i,j as well as linear so you don't need a lookup when you find an element you're looking for.)
If you do need this:
For large arrays, it breaks down into whole vectors pretty nicely, getting tricky only at the ends of rows.
You might use a pre-defined vector constant for the special case of the last corner where you have multiple triangle rows within the same vector of 4 or 8 int
elements.
first = [0,0,0,1,1,2]
With a larger triangle, we're storing large runs of the same number (like memset
), then a slightly-shorter run of the next number, etc. i.e. storing a whole row of 0
s is easy. For all but the last couple rows, these runs are larger than 1 vector element.
second = [1,2,3,2,3,3]
Again, within a single row it's a simple pattern to vectorize. To store an increasing sequence, start with a vector of {1,2,3,4}
and increment it with SIMD add with {4,4,4,4}
, i.e. _mm_set1_epi32(1)
. For 256-bit AVX2 vectors, _mm256_set1_epi32(8)
to increment an 8-element vector by 8.
So within the inner-most loop, you're just storing one invariant vector, and using _mm256_add_epi32
on another and storing it to the other array.
Compilers can already auto-vectorize your function pretty decently, although the end-of-row handling is a lot more complex than you could do by hand. With your code on the Godbolt compiler explorer (with __restrict
to tell the compiler the output arrays don't overlap, and __builtin_assume_aligned
to tell the compilers that they're aligned), we get an inner loop like this (from gcc):
.L4: # do {
movups XMMWORD PTR [rcx+rax], xmm0 # _mm_store_si128(&second[index], xmm0)
paddd xmm0, xmm2 # _mm_add_epi32
movups XMMWORD PTR [r10+rax], xmm1 # _mm_store_si128(&second[index], counter_vec)
add rax, 16 # index += 4 (16 bytes)
cmp rax, r9
jne .L4 # }while(index != end_row)
If I have time, I might write this up in more detail, including better handling of the ends of rows. e.g. partially-overlapping store that ends at the end of the row is often good. Or unroll the outer loop so the inner loops have a repeating pattern of cleanup.
Calculating the starting vectors for the next outer-loop iteration could be done with something like:
vsecond = _mm256_add_epi32(vsecond, _mm256_set1_epi32(1));
vfirst = _mm256_add_epi32(vfirst, _mm256_set1_epi32(1));
i.e. turn {0,0,0,0,...}
into {1,1,1,1,...}
by adding a vector of all ones. And turn {1,2,3,4,5,6,7,8}
into {2,3,4,5,6,7,8,9}
by adding a vector of all ones.