3

I want to optimize histogram statistic code with neon intrinsics.But I didn't succeed.Here is the c code:

#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256] = {0};
for (int i = 0; i < NUM; i++)
{
    histogram_result[src_data[i]]++;
}

Historam statistic is more like serial processing.It's difficult to optimize with neon intrinsics.Does anyone know how to optimize?Thanks in advance.

maofu
  • 163
  • 2
  • 12
  • 1
    It's almost impossible to optimise histogramming with any kind of SIMD, Neon included. The only notable exception being AVX-512. – Paul R Jul 21 '16 at 13:06
  • 1
    Like Paul said, your algorithm needs gather-loads and scatter-stores, which most SIMD instruction sets don't provide. – Peter Cordes Jul 21 '16 at 13:40
  • Related: tiny histograms (like 4 buckets) can use `count[0] += (arr[i] == 0)` which you can vectorize with SIMD packed compares - [Micro Optimization of a 4-bucket histogram of a large array or list](https://stackoverflow.com/q/61122144) – Peter Cordes Apr 10 '20 at 18:26
  • 1
    It occurs to me that if you're happy with a stochastic result then you can compress the results into probability masks, where a high percentage of the increments are randomly discarded with a logarithmic probability. This compresses the accumulators and lets the bulk of the runtime focus on rejection testing which you _can_ optimise for parallel operation. I just can't spare the brain cycles to develop this, though. – sh1 Oct 07 '22 at 21:15
  • Actually, @PeterCordes points to a method that would be part of my technique. A stochastic counter works something like `if (random(k) > f(counter[x])) counter[x]++;`, so as `counter[x]` grows the chances of the condition passing get progressively lower, and you apply an inverse function on `counter[]` at the end to compensate. This test can cover batches of counters by taking the minimum of the batch as the threshold, and only doing further work when the tes triggers an increment. And you can parallelise these tests a few different ways. – sh1 Oct 07 '22 at 21:30
  • (don't forget when applying the inverse transform that the sum of all counters should be `NUM`, but the sum of approximate counters will not be -- so they should be renormalised to the proper sum to avoid problems there) – sh1 Oct 07 '22 at 21:35

1 Answers1

5

You can't vectorise the stores directly, but you can pipeline them, and you can vectorise the address calculation on 32-bit platforms (and to a lesser extent on 64-bit platforms).

The first thing you'll want to do, which doesn't actually require NEON to benefit, is to unroll the histogram array so that you can have more data in flight at once:

#define NUM (7*1024*1024)
uint8 src_data[NUM];
uint32 histogram_result[256][4] = {{0}};
for (int i = 0; i < NUM; i += 4)
{
    uint32_t *p0 = &histogram_result[src_data[i + 0]][0];
    uint32_t *p1 = &histogram_result[src_data[i + 1]][1];
    uint32_t *p2 = &histogram_result[src_data[i + 2]][2];
    uint32_t *p3 = &histogram_result[src_data[i + 3]][3];
    uint32_t c0 = *p0;
    uint32_t c1 = *p1;
    uint32_t c2 = *p2;
    uint32_t c3 = *p3;
    *p0 = c0 + 1;
    *p1 = c1 + 1;
    *p2 = c2 + 1;
    *p3 = c3 + 1;
}

for (int i = 0; i < 256; i++)
{
    packed_result[i] = histogram_result[i][0]
                     + histogram_result[i][1]
                     + histogram_result[i][2]
                     + histogram_result[i][3];
}

Note that p0 to p3 can never point to the same address, so reordering their reads and writes is just fine.

From that you can vectorise the calculation of p0 to p3 with intrinsics, and you can vectorise the finalisation loop.

Test it as-is first (because I didn't!). Then you can experiment with structuring the array as result[4][256] instead of result[256][4], or using a smaller or larger unroll factor.

Applying some NEON intrinsics to this:

uint32 histogram_result[256 * 4] = {0};
static const uint16_t offsets[] = { 0x000, 0x001, 0x002, 0x003,
                                    0x000, 0x001, 0x002, 0x003 };
uint16x8_t voffs = vld1q_u16(offsets);
for (int i = 0; i < NUM; i += 8) {
    uint8x8_t p = vld1_u8(&src_data[i]);
    uint16x8_t p16 = vshll_n_u8(p, 16);
    p16 = vaddq_u16(p16, voffs);
    uint32_t c0 = histogram_result[vget_lane_u16(p16, 0)];
    uint32_t c1 = histogram_result[vget_lane_u16(p16, 1)];
    uint32_t c2 = histogram_result[vget_lane_u16(p16, 2)];
    uint32_t c3 = histogram_result[vget_lane_u16(p16, 3)];
    histogram_result[vget_lane_u16(p16, 0)] = c0 + 1;
    c0 = histogram_result[vget_lane_u16(p16, 4)];
    histogram_result[vget_lane_u16(p16, 1)] = c1 + 1;
    c1 = histogram_result[vget_lane_u16(p16, 5)];
    histogram_result[vget_lane_u16(p16, 2)] = c2 + 1;
    c2 = histogram_result[vget_lane_u16(p16, 6)];
    histogram_result[vget_lane_u16(p16, 3)] = c3 + 1;
    c3 = histogram_result[vget_lane_u16(p16, 7)];
    histogram_result[vget_lane_u16(p16, 4)] = c0 + 1;
    histogram_result[vget_lane_u16(p16, 5)] = c1 + 1;
    histogram_result[vget_lane_u16(p16, 6)] = c2 + 1;
    histogram_result[vget_lane_u16(p16, 7)] = c3 + 1;
}

With the histogram array unrolled x8 rather than x4 you might want to use eight scalar accumulators instead of four, but you have to remember that that implies eight count registers and eight address registers, which is more registers than 32-bit ARM has (since you can't use SP and PC).

Unfortunately, with address calculation in the hands of NEON intrinsics, I think the compiler can't safely reason on how it might be able to re-order reads and writes, so you have to reorder them explicitly and hope that you're doing it the best possible way.

sh1
  • 4,324
  • 17
  • 30
  • 1
    I like this idea of trying to reduce store-forwarding delays for multiple increments of the same entry. Touching more memory is problematic, though, as is zeroing it all first. – Peter Cordes Jul 22 '16 at 04:40
  • `result[4][256]` is much faster to process at the end, since you only need vertical SIMD operations. i.e. you can sum the 4 (or 3, or 7) partitions for 4 SIMD elements in parallel. If you're processing a LOT of input data, then it might be worth using `result[256][4]` and having to do a horizontal sum of the N partitions for each bucket, if your data has a distribution that leads to better cache behaviour during the increment phase. e.g. a run of the same input value creates 4 accesses to the same cache line, rather than to 4 cache lines with a stride of `256*sizeof(int)`. – Peter Cordes Jul 22 '16 at 04:47
  • `you can vectorise the calculation of p0 to p3 with intrinsics`. Not usefully, without gather + scatter. With highly efficient AVX2 gathers (e.g. on Skylake), it might be a win to do a gather load and then scatter manually, but calculating the addresses in a SIMD vector and then extracting each vector element to an integer register for use as a pointer is expensive. Even worse when you also have to extract the incremented counters to memory. (For x86, though, there's an instruction (`pextrd`) which can extract a vector element directly to memory (with an immediate element index)). – Peter Cordes Jul 22 '16 at 04:56
  • On ARM32 NEON, where vectors are composed of multiple narrower registers, you get per-element addressing "for free", so doing 4 scalar stores should be cheap. (But it doesn't have a gather load, so you couldn't get 4 counters into a vector in the first place.) – Peter Cordes Jul 22 '16 at 04:59
  • 2
    @PeterCordes, NEON allows you to fill two scalar registers from a vector in a single operation, so it's just two of those operations for four lanes; but the latency may be a problem on in-order cores. When I implemented it (in my day job) I read from `src_data` and did all the address arithmetic in NEON, then passed the addresses to scalar, then did load, increment, and store all in scalar. I figure the scalar path is probably better optimised for random-access concurrency and it wasn't worth going through SIMD just for an increment operation. – sh1 Jul 22 '16 at 05:51
  • @sh1@PeterCorders,Thank you very much.The original c code consumed 75ms,and sh1's c code comsumed 59ms(built with -O2). But I still don't know how to optimize sh1's code with neon intrinsics.Hi sh1,can you put some code to explain "I read from src_data and did all the address arithmetic in NEON, then passed the addresses to scalar, then did load, increment, and store all in scalar. I figure the scalar path is probably better optimised for random-access concurrency ".Thanks again. – maofu Jul 22 '16 at 09:19
  • Fun idea! My intuition is in agreement with @PeterCordes that when caching is a big part of performance that the benefits are probably sensitive to the application (aka data sizes) and platform. IE - I think its always important to mention caching pros/cons in relation to optimization of any sort. However, this is the first time in a while I've seen something and thought "oh, cool, I hadn't thought of that!" Nice! I'm definitely looking forward to playing with this idea some time. – Peter M Jul 25 '16 at 17:09
  • 1
    I like to think (but am too lazy to test) that since the hot area is contiguous, that the expansion from 1kB to 4kB is negligible. When that same extra 3kB is randomly distributed through memory it's more prone to aliasing which can easily drive things off cliffs. – sh1 Jul 26 '16 at 06:16
  • @sh1: Yes, that sounds sensible. The "hash function" that caches use to distribute physical addresses into sets is (almost?) always a trivial modulo by the relevant power of 2. (e.g. [bits 6 to 11 of the address](http://stackoverflow.com/a/38549736/224132)). The extra 3k to zero and cache is fine as long as it's negligible compared to the size of the data you're histogramming. (e.g. if you need a fast histo function for many small inputs, instead of one large input, the extra zeroing and sum at the end can be a problem. Or if it's a histogram with 2^16 buckets instead of 2^8) – Peter Cordes Jul 26 '16 at 09:20