Accumulating Doubles Into Bins via intrinsics

Question

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.

A naive implementation is below:

const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS];    // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};

for (int i = 0; i < N_OBS; ++i) {
  acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}

Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.

There's no pattern to your `offsets[]` array that would let you used fixed shuffles? Or any typical pattern you can use but check for exceptions? This is quite similar to the histogram problem, just adding the value instead of incrementing a counter. Unrolling and distributing over multiple `acc[]` arrays will allow you to hide FP and store/reload latency if there are runs of the same bin in `offsets[]`. (Then SIMD is helpful at the end to add the corresponding doubles 4 at a time instead of 1, although that's trivial vs. the main loop and will auto-vectorize.) — Peter Cordes, Apr 15 '22 at 04:06
See [Methods to vectorise histogram in SIMD?](https://stackoverflow.com/q/12985949) — Peter Cordes, Apr 15 '22 at 04:10
It might help to accumulate into `16*16` `__m128d` accumulators which are indexed via `offsets[2*i]*16 + offsets[2*i+1]` -- this would approximately half the number of required loads/stores/fpadd operations in the main loop. — chtz, Apr 15 '22 at 14:13
@PeterCordes Thank you for the reply and the link. Unfortunately, there are no patterns to the offsets, they are completely random (within the range). It's a frustrating problem, but there's a lot of literature on vectorizing histograms, so thank you for that lead! — ptau, Apr 15 '22 at 14:54
@chtz The number of bins isn't always fixed, but that's an interesting approach. I'm not against limiting the max bins in order to get more efficient computation. — ptau, Apr 15 '22 at 14:59
@ptau The approach is also possible if the number is not fixed. Even doing an integer multiplication to compute the index should not hurt too much. If you have too many bins your accumulators won't fit into L1 cache anymore, however. — chtz, Apr 15 '22 at 15:03
@chtz: If load / store ports are a bottleneck, another thing that could help is doing __m256d loads and unpack to 128-bit halves with `vextractf128`. (For memory-source `vaddpd xmm`. But on Haswell/Skylake, it would actually be better for the front-end to use non-VEX `addpd xmm, mem` from an aligned destination, because [that could keep an indexed addressing mode micro-fused](https://stackoverflow.com/q/26046634/224132). And on Haswell / ICL, you can't mix that non-VEX with a 256-bit load without SSE/AVX stalls. You can on Skylake, but you won't get a compiler to emit that.) — Peter Cordes, Apr 16 '22 at 00:43
I don’t think scatter/gather gonna work. Potentially, depending on the offset values, there’re data dependencies between SIMD lanes. For instance, if 4 consecutive offset values are [ 0, 0, 0, 0 ], the `_mm256_i32gather_pd` instruction will broadcast the first value from the histogram, you increment that vector by 4 source values, the results will be incorrect because you actually need horizontal sum across the vector. — Soonts, Apr 17 '22 at 13:56
Best thing you can do instead, rework surrounding code so that whichever code produced your input data updates histogram on the fly, as the new values are generated or loaded. — Soonts, Apr 17 '22 at 13:56
There is one area for optimization: the calculation will be performed thousands of times with the same `offsets`, but with different `obs` each time. I tried loop unrolling and multiple accumulators, @PeterCordes, @chtz, but it didn't seem to help much. I don't know enough about SIMD to understand your comments about load bottlenecks, but I'm researching it. — ptau, Apr 18 '22 at 14:06
@Soonts: yes, possible dependencies between lanes for scatter/gather are why [`vpconflictd`/`q`](https://www.felixcloutier.com/x86/vpconflictd:vpconflictq) exist. If your data normally *doesn't* have any conflicts, it's just some ALU overhead to check for it, with an occasional rerun with masking to do the thing for the conflicting vector, potentially repeating until you've finished all the conflicting elements. (Or fallback to scalar). But with few bins, conflicts will be frequent, making that strategy horrible. — Peter Cordes, Apr 18 '22 at 19:15
If `offsets` does not change, you could instead collect all indexes where `offset[i]==0`, `offset[i]==1`, etc (essentially like a sparse matrix in compressed-row-storage but without storing implicit 1-entries). You can then use AVX2's `vgatherdpd` to load 4 values at once. — chtz, Apr 22 '22 at 20:37
Related question: https://stackoverflow.com/questions/51461429/why-is-sparse-dense-multiplication-faster-than-dense-sparse-multiplication — chtz, Apr 22 '22 at 20:46
Can you run it on a GPU? PCIe 4.0 x16 has 31.5 GB/s speed to transfer the data there - with Multi-GPU even faster, if your mainboard has enough lanes (The PCIe interface would be the bottleneck) — Sebastian, Apr 23 '22 at 16:05

score 2 · Accepted Answer · answered Apr 18 '22 at 22:08

Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).

No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.

void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
    // Count of OpenMP threads = CPU cores to use
    constexpr int ompThreadsCount = 4;

    // Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
    // Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
    alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
    memset( &accumulators, 0, sizeof( accumulators ) );

    // Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
    for( int i = 0; i < ompThreadsCount; i++ )
    {
        // Grab a slice of the output buffer
        double* const acc = &accumulators[ i * 16 ];

        // Compute a slice of the source data for this thread
        const size_t first = i * length / ompThreadsCount;
        const size_t last = ( i + 1 ) * length / ompThreadsCount;

        // Accumulate into thread-local portion of the buffer
        for( size_t i = first; i < last; i++ )
        {
            const int8_t idx = indices[ i ];
            acc[ idx ] += rsi[ i ];
        }
    }

    // Reduce 16*N scalars to 16 with a few AVX instructions
    for( int i = 0; i < 16; i += 4 )
    {
        __m256d v = _mm256_load_pd( &accumulators[ i ] );
        for( int j = 1; j < ompThreadsCount; j++ )
        {
            __m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
            v = _mm256_add_pd( v, v2 );
        }
        _mm256_storeu_pd( rdi + i, v );
    }
}

The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.

P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.

You didn't find it worthwhile to have each thread unroll over 2 or 4 copies of a count array, to avoid latency bottlenecks if the same index comes up frequently within a smallish block (about ROB or per-port scheduler size)? With indices that are uniformly random, probably not a problem, but could be with real data if it's not uniform. — Peter Cordes, Apr 19 '22 at 00:27
Thank you for this answer. OpenMP does offer some really good speedup, and is the approach I'll end up going with. One thing I noticed is that using `schedule(dynamic)` offered the best performance (about 85% of `(static, 1)`. Could it be that there's some cache locality considerations, especially on servers with multiple CPU packages? — ptau, Apr 20 '22 at 13:59
@ptau I don’t really know why dynamic is better. Can confirm here in VS2022/Win10, albeit not my much, a few percent. Might be different optimization choices made by the compiler, or maybe different behavior of OMP runtime affecting the performance. — Soonts, Apr 21 '22 at 12:55

score 1 · Answer 2 · edited Apr 23 '22 at 15:44

If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.

Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).

As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).

Untested example implementation:

using sparse_matrix = std::vector<std::vector<int> >;

// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
    sparse_matrix res;
    res.resize(n_bins);

    // count entries for each bin: 
    for(int i=0; i<n_obs; ++i) {
        // assert(offsets[i] < n_bins);
        res[offsets[i]].push_back(i);
    }

    return res;
}

void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
    for(std::size_t row=0; row<indexes.size(); ++row) {
        double sum = 0;
        for(int col : indexes[row]) {
            // you can manually vectorize this using _mm256_i32gather_pd,
            // but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
            sum += obs[col];
        }
        acc[row] = sum;
    }
}

To allow ILP without too much unrolling of the inner loop, you could maybe set things up so you alternate work from two rows at once. But if they're different lengths, that means extra complexity vs. just using more accumulators to reduce at the end, so maybe not. — Peter Cordes, Apr 23 '22 at 00:25

Accumulating Doubles Into Bins via intrinsics

2 Answers2