16

I am trying to implement histogram in Neon. Is it possible to vectorise ?

Rugger
  • 373
  • 3
  • 10
  • If you're targeting iOS, you may be able to use the vImage histogram implementations supplied in the Accelerate.framework; not sure what's available on other platforms. – Stephen Canon Oct 21 '12 at 01:00
  • 1
    @StephenCanon I am working on android at present, So yeah, I guess i have to just use the C version of the code. No use in going to assembly programming right? – Rugger Oct 30 '12 at 13:08
  • Related: tiny histograms (like 4 buckets) can use `count[0] += (arr[i] == 0)` which you can vectorize with SIMD packed compares - [Micro Optimization of a 4-bucket histogram of a large array or list](https://stackoverflow.com/q/61122144) – Peter Cordes Apr 10 '20 at 18:24

4 Answers4

11

Histogramming is almost impossible to vectorize, unfortunately.

You can probably optimise the scalar code somewhat however - a common trick is to use two histograms and then combine them at the end. This allows you to overlap loads/increments/stores and thereby bury some of the serial dependencies and associated latencies. Pseudo code:

init histogram 1 to all 0s
init histogram 2 to all 0s
loop
  get input value 1
  get input value 2
  load count for value 1 from histogram 1
  load count for value 2 from histogram 2
  increment count for histogram 1
  increment count for histogram 2
  store count for value 1 to histogram 1
  store count for value 2 to histogram 2
until done
combine histogram 1 and histogram 2
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    agreed, I saw a few papers on the subject but most of it was actually suggesting changes to the instruction set :) – auselen Oct 20 '12 at 07:32
  • It might be possible to an extent with SIMD ISAs that support scatter/gather, but you would need N partial histograms, and the fact that histogramming is mostly just loads and stores means that performance would probably be limited by the memory subsystem. – Paul R Oct 20 '12 at 13:55
  • Almost impossible to vectorize *completely* perhaps, but it can definitely benefit from use of vector instructions, at least on some architectures. – Stephen Canon Oct 21 '12 at 00:59
  • 1
    @Stephen Canon: I'd be interested in any references for this as I've looked into the subject quite extensively and not found any useful SIMD optimisation techniques for histogramming. – Paul R Oct 21 '12 at 16:06
  • 2
    @PaulR - On the GPU, you can use point scattering as one approach: http://stackoverflow.com/questions/10316708/ios-glsl-is-there-a-way-to-create-an-image-histogram-using-a-glsl-shader/10359305#10359305 , and that has some parallel character to it. The paper I link to includes a few other image histogram generation techniques, with varying degrees of parallel operations being used for those. I could see possibly adapting one of those to a CPU-based routine that might benefit from SIMD operations in places. Apple does this in their Core Image histogram filter, from what I hear. – Brad Larson Oct 24 '12 at 17:41
  • @Brad: thanks - that's interesting - I'm not sure it will work for the kinds of histogram that I need to generate, but I'll study it carefully in case there's anything I can exploit. – Paul R Oct 24 '12 at 17:51
  • @Antonio: interesting, yes, but not of any practical use - see my comment on [Amir's answer](http://stackoverflow.com/a/23019540/253056). – Paul R Feb 16 '16 at 15:17
8

ermig1979 has a Simd project which shows how he has done histograms using a similar approach to what @Paul-R has mentioned but also with SSE2 and AVX2 variants:

Project: https://github.com/ermig1979/Simd

Base file: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdBaseHistogram.cpp

An AVX2 implementation can be seen here: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Histogram.cpp

A scalar solution can be seen below to illustrate the basic principle of creating multiple histograms that are summed at the end:

void Histogram(const uint8_t * src, size_t width, size_t height, size_t stride, 
    uint32_t * histogram)
{
    uint32_t histograms[4][HISTOGRAM_SIZE];
    memset(histograms, 0, sizeof(uint32_t)*HISTOGRAM_SIZE*4);
    size_t alignedWidth = Simd::AlignLo(width, 4);
    for(size_t row = 0; row < height; ++row)
    {
        size_t col = 0;
        for(; col < alignedWidth; col += 4)
        {
            ++histograms[0][src[col + 0]];
            ++histograms[1][src[col + 1]];
            ++histograms[2][src[col + 2]];
            ++histograms[3][src[col + 3]];
        }
        for(; col < width; ++col)
            ++histograms[0][src[col + 0]];

        src += stride;
    }

    for(size_t i = 0; i < HISTOGRAM_SIZE; ++i)
        histogram[i] = histograms[0][i] + histograms[1][i] + 
            histograms[2][i] + histograms[3][i];
}
nietras
  • 3,949
  • 1
  • 34
  • 38
  • This is a scalar solution, not a vectorized one. – BeeOnRope May 24 '18 at 23:55
  • "but also with SSE2 and AVX2" see links, but yes the code listed is not vectorized explicitly. – nietras May 25 '18 at 10:08
  • 1
    I mean the OP is asking for a vectorized histogram, and you say "this guy has some" and then you excerpt a plain(ish) scalar implementation? If anything you should link at least a vectorized one. It is, I suppose, possible that a compiler would vectorize this one, with gather then scatter, plus conflict detection for the inner loop (does NEON have all of that?), but I'll believe it when I see it. – BeeOnRope May 25 '18 at 14:47
  • @BeeOnRope: Using a separate histogram for each vector element (like is done here) avoids the need for conflict-detection. [How to optimize histogram statistics with neon intrinsics?](https://stackoverflow.com/q/38501529) suggests using NEON to vectorize the index calculations (but only if NEON->integer is cheap; that's not the case on many ARM CPUs). The cleanup sum into one histogram at the end is easy to vectorize, of course. – Peter Cordes May 28 '18 at 12:56
  • You can sum into `histograms[0]` instead of a separate array (and then `realloc` if you want), or use `histogram[]` instead of `histograms[3]` to save cache footprint. – Peter Cordes May 28 '18 at 12:57
  • @PeterCordes - that's true, but I doubt this was the intent here (this is the base non-vectoized implementation). The 4(ish) buckets are a common trick in scalar implementations in order to limit stalls caused by long dependent chains of store-forwarding when the histogram has some values with high frequency. Vectorizing the index calculations seems quite pointless - ARM has double register indirect addressing (including scaled) right? So the stores can just use that mode. Even if internally those modes use a couple of ops to calculate the address, it would have to really slow... – BeeOnRope May 28 '18 at 17:46
  • ... to be worse than a vectorized calculation and then a one-by-one extraction of the address into a GP register so you could use it. Although, I guess you get to use it twice (load, store) so maybe that makes it potentially a bit better. I don't know how fast scaled indexing is on common ARM chips. – BeeOnRope May 28 '18 at 17:47
  • 2
    Anyways, I did finally look up the "AVX2" histogram [linked above](https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Histogram.cpp). As far as I can tell, the actual histogram calculation is not vectorized. What happens is that there are various "flavors" of histogram, such as masked histogram (only values selected by the mask are counted) or conditional histogram (only values that meet some simple criteria passed to the function are included). The handling of those extra features are done in a vectorized matter (see `MaskSrc` or `ConditionalSrc`) ... – BeeOnRope May 28 '18 at 17:59
  • ... basically by doing the masking or conditional check with AVX2 and then writing the result back to a buffer (expanded to 16-bits to accomodate the 257th state of "don't count"). At the end of all that a normal scalar histogram is done on these values. So there doens't appear to be any real vectorization of the histogram part. Honestly I suspect a scalar `cmov`-based approach would be just as fast. – BeeOnRope May 28 '18 at 18:24
  • @BeeOnRope: yeah, using NEON for address calc sounded bogus to me, too. I'm still not sure whether it was a win or not, but [this comment](https://stackoverflow.com/questions/38501529/how-to-optimize-histogram-statistics-with-neon-intrinsics#comment64433884_38511540) says "NEON allows you to fill two scalar registers from a vector in a single operation". Maybe it's a win, or maybe it was just not slower, or maybe it worked around a scalar missed-optimization in that user's case. An ARM load-pair instruction into scalar registers should set up nicely for indexed addressing, as you say. – Peter Cordes May 28 '18 at 21:50
  • @PeterCordes - that's a cool feature, I've often wished for it on x86. On architectures (most?) that only have one destination register per micro-op, this is going to end up as 2 uops, so my wish is probably futile (if it existed on x86 it would probably perform the same as N `movd` ops or whatever). I've also wished for the opposite (GP regs to vector, and this doens't have the one-dest limitation). – BeeOnRope May 28 '18 at 22:18
  • @BeeOnRope: agreed; ARM's ability to efficiently write 2 registers from one instruction is one thing I envy. Some NEON 2-register shuffles also do that. – Peter Cordes May 28 '18 at 22:22
  • @PeterCordes - I guess I'm implying that on high-performance ARM it may not be particularly efficient: can we be sure it is not internally split in two ops? Most renamers are designed to handle a relatively fixed format like 1 dest and N source regs, especially wide ones. So I wouldn't be surprised that internally this kind of instruction takes up 2 rename "slots" or whatever you want to call them. Just speculation though. Not that it would be bad, it would still save instruction bytes and be efficient in other parts of the pipeline such as only counting for 1 load in the load EUs. – BeeOnRope May 28 '18 at 22:48
  • @BeeOnRope: IDK. Glibc's AArch64 `memcpy` uses `ldp` / `stp` (load-pair) to copy 16 bytes at a time, rather than using SIMD vectors. Last I looked, I don't think there was a separate version that used SIMD `q` registers. That doesn't prove anything about renamer throughput, though. – Peter Cordes May 28 '18 at 22:50
  • Probably `memcpy` isn't a good example for a case where total width/rename throughput matters since it's highly likely to be limited by load/store ports rather than total/rename width, and perhaps there is some other slight advantage to using the GP regs? Since "number of registers (destinations) renamed" is basically the critical parameter for a renamer, it seems highly unlikely to me that wide designs such as Apple's 6-wide can actually rename say 12 regs to accommodate the load-double case. Still, max width is unlikely to be a bottleneck in most code, so ... – BeeOnRope May 28 '18 at 22:59
  • ... even if what I'm saying about rename is correct (and it very well might not be), it's probably actually get most of the advantages anyways since load/store EU limitations are more likely IMO. Actually `memcpy` isn't a great example though, with NEON, since yeah - why not just use NEON? – BeeOnRope May 28 '18 at 23:00
2

@Paul-R, there exists some paper at this link which discusses how to vectorize histogram functions:

SIMD Vectorization of Histogram Functions

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 3
    Yes, it's an interesting paper, but AIUI in the paper they are just inventing some hypothetical new SIMD instructions which *would* make histogramming easier - it doesn't solve the problem of using SIMD for histogramming on real-world CPUs (ARM with Neon SIMD in the OP's case)? – Paul R Feb 16 '16 at 15:17
  • @PaulR The paper propose 2 methods. In the method proposed in section 4.1, which is the missing instruction? Indexed load and indexed store? – Antonio Feb 16 '16 at 20:31
  • 2
    @Antonio: yes, it seems they added scatter/gather store/load instructions in the simulator. AVX2 and AVX-512 have similar instructions but they tend to be very inefficient. – Paul R Feb 16 '16 at 23:22
  • @PaulR can you please name the gather/scatter instructions in AVX2? – Amir ElAttar Jan 05 '17 at 16:39
  • 1
    @AmirElAttar: AVX2 has gather, but not scatter, e.g. `_mm256_mask_i32gather_epi32` - AVX-512 has both gather and scatter, e.g. `_mm512_i32extscatter_epi32` - see the [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2&text=gather&expand=2775) for further details. – Paul R Jan 05 '17 at 16:59
  • 1
    @PaulR Thanks , So much for your answer – Amir ElAttar Jan 06 '17 at 21:54
1

Some image processing algorithm working on histograms (e.g. equalization, histogram matching) can be made work with known percentiles -- and for an approximation one can effectively parallelize the search to initial ranges (0,25,50,75,100%) consuming 4 accumulators.

Each item in the input stream must be compared in parallel to all slots, incrementing the frequency. After a certain number of rounds (e.g. n*255 rounds guaranteeing no overflows on uint8_t data type, then accumulating those to uint16_t) the min/max limits in each slot are recalculated based on linear interpolation. And it's of course possible to re-run the sequence based on an estimation how much the new data has changed the estimates of the percentiles.

The algorithm would be variant to evaluation order, which could be mitigated by random sampling and multiple passes.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57