I'm trying to leverage the new AVX2 GATHER instructions to speed up a sparse matrix - vector multiplication. The matrix is in CSR (or Yale) format with a row pointer that points to a column index array which in turn holds the columns. The C code for such a mat-vec mul does look like this:
for (int row = 0; row < n_rows - 1; row++) {
double rowsum = 0;
for (int col = row_ptr[row]; col < row_ptr[row + 1]; col++) {
rowsum += values[col] * x[col_indices[col]];
}
result[row] = rowsum;
}
Now my goal is to accelerate this with AVX2 intrinsics. The following code works with latest Intel or GCC, based on https://blog.fox-toolkit.org/?p=174 . I removed the remainder here because my rows all align on 4 doubles (columns % 4==0) anyway (lucky me). I have the code dealing with remainder too if someone is interested, but the point is, the code is actually slightly slower. I checked the disassembly and for the above version only FP instructions are generated and for my AVX2 code all AVX2 ops appear as expected. Even with small matrices that fit into the cache the AVX2 version is no good. I'm puzzled here...
double* value_base = &values[0];
double* x_base = &x[0];
int* index_base = &col_indices[0];
for (int row = 0; row < n_rows - 1; row++) {
int col_length = row_ptr[row + 1] - row_ptr[row];
__m256d rowsum = _mm256_set1_pd(0.);
for (int col4 = 0; col4 < col_length; col4 += 4) {
// Load indices for x vector(const __m128i*)
__m128i idxreg = _mm_load_si128((const __m128i*)index_base);
// Load 4 doubles from x indexed by idxreg (AVX2)
__m256d x_ = _mm256_i32gather_pd(x_base, idxreg, 8);
// Load 4 doubles linear from memory (value array)
__m256d v_ = _mm256_load_pd(value_base);
// FMA: rowsum += x_ * v_
rowsum = _mm256_fmadd_pd(x_, v_, rowsum);
index_base += 4;
value_base += 4;
}
__m256d s = _mm256_hadd_pd(rowsum, rowsum);
result[row] = ((double*)&s)[0] + ((double*)&s)[2];
// Alternative (not faster):
// Now we split the upper and lower AVX register, and do a number of horizontal adds
//__m256d hsum = _mm256_add_pd(rowsum, _mm256_permute2f128_pd(rowsum, rowsum, 0x1));
//_mm_store_sd(&result[row], _mm_hadd_pd( _mm256_castpd256_pd128(hsum), _mm256_castpd256_pd128(hsum) ) );
}
Any suggestions are welcome.
Thanks a lot, Chris