6

I'm trying to leverage the new AVX2 GATHER instructions to speed up a sparse matrix - vector multiplication. The matrix is in CSR (or Yale) format with a row pointer that points to a column index array which in turn holds the columns. The C code for such a mat-vec mul does look like this:

for (int row = 0; row < n_rows - 1; row++) {
    double rowsum = 0;
    for (int col = row_ptr[row]; col < row_ptr[row + 1]; col++) {
        rowsum += values[col] * x[col_indices[col]];
    }
    result[row] = rowsum;
}

Now my goal is to accelerate this with AVX2 intrinsics. The following code works with latest Intel or GCC, based on https://blog.fox-toolkit.org/?p=174 . I removed the remainder here because my rows all align on 4 doubles (columns % 4==0) anyway (lucky me). I have the code dealing with remainder too if someone is interested, but the point is, the code is actually slightly slower. I checked the disassembly and for the above version only FP instructions are generated and for my AVX2 code all AVX2 ops appear as expected. Even with small matrices that fit into the cache the AVX2 version is no good. I'm puzzled here...

double* value_base = &values[0];
double* x_base = &x[0];
int*    index_base = &col_indices[0];


for (int row = 0; row < n_rows - 1; row++) {
    int col_length   = row_ptr[row + 1] - row_ptr[row];

    __m256d rowsum = _mm256_set1_pd(0.);
    for (int col4 = 0; col4 < col_length; col4 += 4) {
        // Load indices for x vector(const __m128i*)
        __m128i idxreg     = _mm_load_si128((const __m128i*)index_base);
        // Load 4 doubles from x indexed by idxreg (AVX2)
        __m256d x_     = _mm256_i32gather_pd(x_base, idxreg, 8);
        // Load 4 doubles linear from memory (value array)
        __m256d v_     = _mm256_load_pd(value_base);
        // FMA: rowsum += x_ * v_
        rowsum = _mm256_fmadd_pd(x_, v_, rowsum);

        index_base += 4;
        value_base += 4;
    }
    __m256d s = _mm256_hadd_pd(rowsum, rowsum);
    result[row] = ((double*)&s)[0] + ((double*)&s)[2];
    // Alternative (not faster):
    // Now we split the upper and lower AVX register, and do a number of horizontal adds
    //__m256d hsum = _mm256_add_pd(rowsum, _mm256_permute2f128_pd(rowsum, rowsum, 0x1));
    //_mm_store_sd(&result[row], _mm_hadd_pd( _mm256_castpd256_pd128(hsum), _mm256_castpd256_pd128(hsum) ) );
}

Any suggestions are welcome.

Thanks a lot, Chris

chhu79
  • 175
  • 2
  • 7
  • 3
    Basically, gather sucks. – harold Jul 15 '15 at 12:23
  • 2
    What @harold said: gathered loads are horribly inefficient and will kill performance unless you're doing enough computation afterwards to amortise the initial cost of the loads. You might as well just stick to scalar code for this. – Paul R Jul 15 '15 at 12:46
  • 2
    Emulate gathers with `_mm_move_sd` + `_mm_loadh_pd`. On Haswell it is faster than hardware gather. – Marat Dukhan Jul 15 '15 at 15:28
  • 3
    By far the most important improvement for sparse matrix multiplication is [the permutation matrix](https://stackoverflow.com/questions/29603982/cholesky-decomposition-of-sparse-matrices-using-permutation-matrices). This is going to speed up your code far more than SIMD can. Without a good permutation matrix you risk a lot of fill in and far more computations than necessary (i.e. sparse*sparse may equal dense). It's a NP hard problem so there are many heuristic methods to determine the permutation matrix. It's a complicated problem. You don't have to reinvent just one wheel but many! – Z boson Jul 16 '15 at 07:25
  • [Using Eigen I got a huge speedup](http://stackoverflow.com/a/30526005/2542702) compared to my own dense solver using AVX (although my dense solver managed to still beat many other sparse solvers in speed) – Z boson Jul 16 '15 at 07:27
  • The matrices I'm usually solving with this are around 10-60 million rows with around 40 col entries per row and run on 10k+ cores. So direct solvers are out, unfortunately. For iterative solvers you mostly need dot products and mat-vec mul, that's why any speedup there would be beneficial... – chhu79 Jul 16 '15 at 08:26
  • Isn't there a sparse format that duplicates data, so you can access rows or columns contiguously? I added a paragraph about that in my answer. Or maybe I should make that a new answer for the other topic. – Peter Cordes Jul 16 '15 at 08:44

1 Answers1

9

Gather on Haswell is slow. I implemented an 8-bit-index LUT lookup of 16bit values (for GF16 multiply for par2) in a few different ways, to find out what's fastest. On Haswell, the VPGATHERDD version took 1.7x as long as the movd / pinsrw version. (Only a couple VPUNPCK / shift instructions were needed beyond the gathers.) code here, if anyone wants to run the benchmark.

As is common when an instruction is first introduced, they don't dedicate a huge amount of silicon to making it super-fast. It's there just to get HW support out there, so code can be written to use it. For ideal performance on all CPUs, then you need to do what x264 did for pshufb: have a SLOW_SHUFFLE flag for CPUs like Core2, and factor that into your best-routine-finding function-pointer-setting, rather than just what insns a CPU supports.

For projects less fanatical about having asm versions tuned for every CPU they can run on, introducing a no-speedup version of an instruction will get people using it sooner, so when the next design comes along and its fast, more code speeds up. Releasing a design like Haswell where gather is actually a slowdown is a little dicey. Maybe they wanted to see how people would use it? It does increase code density, which helps when the gather isn't in a tight loop.

Broadwell is supposed to have a faster gather implementation, but I don't have access to one. The Intel manual that lists latency/throughput for instructions says Broadwell's gather is about 1.6x faster, so it would still be slightly slower than a hand-crafted loop that shifts/unpacks the indices in GP regs, and uses them for PINSRW into vectors.

If gather could take advantage of cases where multiple elements had the same index, or even an index that pointed to the same 32B fetch block, there could be some big speedups depending on the input data.

Hopefully Skylake will improve even further. I thought I'd read something saying that it would, but on checking, I didn't find anything.

RE: sparse matrices: isn't there a format that duplicates the data, so you can do contiguous reads for rows or columns? It's not something I've had to write code for, but I think I've seen mention of it in some answers.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • You mean Skylake AVX2 gather will be fast-still than Broadwell gather? Or do you mean AVX512 gather will be faster-still than Broadwell? Skylake won't have AVX512 except for Xeon processors. Do you have a source for your statements in your last paragraph? – Z boson Jul 16 '15 at 07:30
  • Hmm, I forget whether what I read about Skylake AVX2 gather improvements was speculation or sourced. It's a pretty obvious thing to improve in a "tock", assuming there's room for improvement without spending too many transistors. I'll see if I can find what I was remembering. – Peter Cordes Jul 16 '15 at 07:56
  • I saw [this](http://arstechnica.com/gadgets/2015/07/intel-confirms-tick-tock-shattering-kaby-lake-processor-as-moores-law-falters/) today. I guess tick-tock is over. There will be another 14nm processor next year called Kaby Lake. Maybe that will have AVX512 (instead of tick-tock it should be called tick tick tock). – Z boson Jul 16 '15 at 08:02
  • I changed my answer to not predict anything skylake. I'd read that too, about Kaby Lake. "tock" is a new arch on the same process, so it would be "tick tock tock", I guess. IDK why Intel chose to call the names that way around, since the pattern started with a new microarch (Core2). Maybe it wasn't a "pattern" until the 2nd entry: the Penryn "tick". – Peter Cordes Jul 16 '15 at 08:39