Quickly count number of zero-valued bytes in an array

Question

What's a speedy method to count the number of zero-valued bytes in a large, contiguous array? (Or conversely, the number of non-zero bytes.) By large, I mean 2¹⁶ bytes or larger. The array's position and length can consist of whatever byte alignment.

Naive way:

int countZeroBytes(byte[] values, int length)
{
    int zeroCount = 0;
    for (int i = 0; i < length; ++i)
        if (!values[i])
            ++zeroCount;

    return zeroCount;
}

For my problem, I usually just maintain zeroCount and update it based on specific changes to values. However, I'd like to have a fast, general method of re-computing zeroCount after an arbitrary bulk change to values occurs. I'm sure there's a bit-twiddly method of accomplishing this more quickly, but alas, I'm but a novice twiddler.

EDIT: A few people have asked about the nature of the data being zero-checked, so I'll describe it. (It'd be nice if solutions were still general, though.)

Basically, envision a world composed of voxels (e.g. Minecraft), with procedurally generated terrain segregated into cubic chunks, or effectively pages of memory indexed as three-dimensional arrays. Each voxel is fly-weighted as a unique byte corresponding to a unique material (air, stone, water, etc). Many chunks contain only air or water, while others contain varying combinations of a 2-4 voxels in large quantities (dirt, sand, etc), with effectively 2-10% of voxels being random outliers. Voxels existing in large quantities tend to be highly clustered along every axis.

It seems as though a zero-byte-counting method would be useful in a number of unrelated scenarios, though. Hence, the desire for a general solution.

Look for "population count" hardware instructions and compiler intrinsics (e.g. [here for MSVC](http://msdn.microsoft.com/en-us/library/bb385231(v=vs.90).aspx)), and apply them to the largest word size available. — Kerrek SB, Jan 04 '14 at 22:51
There may be ways to exploit specific architectures (e.g. specific opcodes, or SSE on x86), but in general, I doubt there's any way to do this faster. Even lookup tables (for e.g. 16-bit chunks) probably won't help, as they'll just blow the cache. — Oliver Charlesworth, Jan 04 '14 at 22:51
The optimal answer may depend on whether you expect zeros to be common. If not, a bitwise "has-zero" operation on a large (e.g. 64-bit) word at a time might be the best approach, allowing you to skip large runs with no zeros. Beware of aliasing violations, however... — R.. GitHub STOP HELPING ICE, Jan 04 '14 at 23:22
Maybe some tricky implementation as for [strlen](http://www.stdlib.net/~colmmacc/strlen.c.html) may help... (iterate by `uint32_t*` instead of naive `uint8_t`). — Jarod42, Jan 04 '14 at 23:55
@R.. A few people have asked about the nature of the data, so I'll edit the answer to include it. I wanted to keep the problem as short and general as possible, but that may be a bit too limiting. (This is an optimization problem, after all.) — Philip Guin, Jan 05 '14 at 00:10
@Philip, I posted some code which uses SSE to speed up the calculation when the density of zeros is low. — Z boson, Jan 05 '14 at 13:33
I updated the code to handle long runs of zeros as well now. For example if the first half of the array is non zero and the last half is zero (not likely but it shows it works) it's over 10x faster. — Z boson, Jan 05 '14 at 13:53
Actually, it turns out SSE is much faster independent of the density it just turns out to be even faster if there are long runs of zeros. — Z boson, Jan 05 '14 at 14:44
Not posting this as an answer atm, but have you looked at the [relevant code in bit twiddling hacks](http://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord)? — Hasturkun, Jan 05 '14 at 16:17
Maybe if I get some time I'll write this up into a formal answer. Until then, if you want to use SSE2 intrinsics, you can use _mm_cmpeq_epi8 change each byte in a set of 16-bytes at a time into either 0xFF or 0x00, from here you can use _mm_movemask_epi8 to pull the top bit off every byte and put it into an int. If you go SSE4 you can use _mm_popcnt_u32 to count the bits, but if not you can have two look-ups into a table of size 256 for each of the two bytes. Note this requires 16-byte alignment but you can accomplish this by doing it the slow way before and after your aligned portion. — Apriori, Jan 07 '14 at 02:52
If you're not going SSE, you can also use this trick to detect a zero. read in 8-byte aligned 64-bit chunks. Then do this computation. unsigned __int64 x; // this evaluates to true if one of the 8 bytes is zero (x - 0x0101010101010101) & 0x8080808080808080 & ~x But it will only detect if there is a zero, you don't know how many without digging further, and you don't know which byte was zero. — Apriori, Jan 07 '14 at 03:03
@Hasturkun I've actually been pouring over that page, believe it or not. Unfortunately, it doesn't *quite* provide what I need, but it's darn close. I need both values ranged 0-255, as well as a precise count, not just existence testing. I'm sure it could be used as a fast preliminary test, which has been suggested below. — Philip Guin, Jan 08 '14 at 07:19
@Philip: The linked page has a `countless` macro which does return a precise count (in this case called as `countless(value, 1)`). I've done some very basic testing and that does seem to beat the naive implementation at least. — Hasturkun, Jan 08 '14 at 08:37

score 7 · Answer 1 · edited Dec 29 '21 at 21:34

This is a special case of How to count character occurrences using SIMD with c=0, the char (byte) value to count matches for. See that Q&A for a well-optimized manually-vectorized AVX2 implementation of char_count (char const* vector, size_t size, char c); with a much tighter inner loop than this, avoiding reducing each vector of 0/-1 matches to scalar separately.

This will go as O(n) so the best you can do is decrease the constant. One quick fix is to remove the branch. This gives a result as fast as my SSE version below if the zeros are randomly distrbuted. This is likely due to the fact the GCC vectorizes this loop. However, for long runs of zeros or for a random density of zeros less than 1% the SSE version below is still faster.

int countZeroBytes_fix(char* values, int length) {
    int zeroCount = 0;
    for(int i=0; i<length; i++) {
        zeroCount += values[i] == 0;
    }
    return zeroCount;
}

I originally thought that the density of zeros would matter. That turns out not to be the case, at least with SSE. Using SSE is a lot faster independent of the density.

Edit: actually, it does depend on the density it just the density of zeros has to be smaller than I expected. 1/64 zeros (1.5% zeros) is one zero in 1/4 SSE registers so the branch prediction does not work very well. However, 1/1024 zeros (0.1% zeros) is faster (see the table of times).

SIMD is even faster if the data has long runs of zeros.

You can pack 16 bytes into a SSE register. Then you can compare all 16 bytes at once with zero using _mm_cmpeq_epi8. Then to handle runs of zero you can use _mm_movemask_epi8 on the result and most of the time it will be zero. You could get a speed up of up to 16 in this case (for first half 1 and second half zero I got over a 12X speedup).

Here is a table of times in seconds for 2^16 bytes (with a repeat of 10000).

                     1.5% zeros  50% zeros  0.1% zeros 1st half 1, 2nd half 0
countZeroBytes       0.8s        0.8s       0.8s        0.95s
countZeroBytes_fix   0.16s       0.16s      0.16s       0.16s
countZeroBytes_SSE   0.2s        0.15s      0.10s       0.07s

You can see the results for last 1/2 zeros at http://coliru.stacked-crooked.com/a/67a169ddb03d907a

#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>                 // SSE2
#include <omp.h>

int countZeroBytes(char* values, int length) {
    int zeroCount = 0;
    for(int i=0; i<length; i++) {
        if (!values[i])
            ++zeroCount;
    }
    return zeroCount;
}

int countZeroBytes_SSE(char* values, int length) {
    int zeroCount = 0;
    __m128i zero16 = _mm_set1_epi8(0);
    __m128i and16 = _mm_set1_epi8(1);
    for(int i=0; i<length; i+=16) {
        __m128i values16 = _mm_loadu_si128((__m128i*)&values[i]);
        __m128i cmp = _mm_cmpeq_epi8(values16, zero16);
        int mask = _mm_movemask_epi8(cmp);
        if(mask) {
            if(mask == 0xffff) zeroCount += 16;
            else {
                cmp = _mm_and_si128(and16, cmp); //change -1 values to 1
                //hortiontal sum of 16 bytes
                __m128i sum1 = _mm_sad_epu8(cmp,zero16);
                __m128i sum2 = _mm_shuffle_epi32(sum1,2);
                __m128i sum3 = _mm_add_epi16(sum1,sum2);
                zeroCount += _mm_cvtsi128_si32(sum3);
            }
        }
    }
    return zeroCount;
}

int main() {
    const int n = 1<<16;
    const int repeat = 10000;
    char *values = (char*)_mm_malloc(n, 16);
    for(int i=0; i<n; i++) values[i] = rand()%64;  //1.5% zeros
    //for(int i=0; i<n/2; i++) values[i] = 1;
    //for(int i=n/2; i<n; i++) values[i] = 0;
    
    int zeroCount = 0;
    double dtime;
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) zeroCount = countZeroBytes(values,n);
    dtime = omp_get_wtime() - dtime;
    printf("zeroCount %d, time %f\n", zeroCount, dtime);
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) zeroCount = countZeroBytes_SSE(values,n);
    dtime = omp_get_wtime() - dtime;
    printf("zeroCount %d, time %f\n", zeroCount, dtime);       
}

You only need to horizontal sum (with SAD) every 127 iterations or something, to avoid overflow. Until then, you can accumulate counts with PADDB on the compare results, treating it as a vector of `-1` or `0`. That's more efficient than this even in the all-zeros or all-non-zeros cases. Even keeping it simple and hsumming to a vector of 64-bit counters every iteration would be pretty good. (i.e. PCMPEQB / PSADBW / PADDQ inside the loop, plus a MOVDQA or two). — Peter Cordes, Oct 23 '16 at 00:24

mcleod_ideafix · Accepted Answer · 2014-01-05T00:37:58.803

5

I've come with this OpenMP implementation, which may take advantage of the array being in the local cache of each processor to actually read it in parallel.

nzeros_total = 0;
#pragma omp parallel for reduction(+:nzeros_total)
    for (i=0;i<NDATA;i++)
    {
        if (v[i]==0)
            nzeros_total++;
    }

A quick benchmark, consisting on running 1000 times a for loop with a naive implementation (the same the OP has written in the question) versus the OpenMP implementation, running 1000 times too, taking the best time for both methods, with an array of 65536 ints with a zero value element probability of 50%, using Windows 7 on a QuadCore CPU, and compiled with VStudio 2012 Ultimate, yields these numbers:

               DEBUG               RELEASE
Naive method:  580 microseconds.   341 microseconds.
OpenMP method: 159 microseconds.    99 microseconds.

NOTE: I've tried the #pragma loop (hint_parallel(4)) but aparently, this didn't cause the naive version to perform any better so my guess is that the compiler was already applying this optimization, or it couldn't be applied at all. Also, #pragma loop (no_vector) didn't cause the naive version to perform worse.

edited Jan 05 '14 at 00:37

answered Jan 04 '14 at 23:37

mcleod_ideafix

11,128
2
24
32

The "in all caches" assumption is a big one, but if it holds, this is a simple but effective technique. – Oliver Charlesworth Jan 05 '14 at 02:05
1

This would probably be even faster if you remove the branch and just do `nzeros_total += v[i] == 0`. – Z boson Jan 05 '14 at 15:50
While this seems effective, I unfortunately don't have an extra processor to spare at the moment. Thank you though. – Philip Guin Jan 06 '14 at 00:08
I've marked this as the best answer since it appears to be both portable and efficient. However, if someone were to invent something both portable, fast, non-trival, and non-parallel, I'd be happy to accept it. (In addition to being amazed!) – Philip Guin Jan 09 '14 at 04:43

score 2 · Answer 3 · answered Oct 30 '17 at 20:02

You can also use POPCNT instruction which returns number of bits set. This allows to further simplify code and speed it up by eliminating unnecessary branches. Here is example with AVX2 and POPCNT:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include "immintrin.h"

int countZeroes(uint8_t* bytes, int length)
{
    const __m256i vZero = _mm256_setzero_si256();
    int count = 0;
    for (int n = 0; n < length; n += 32)
    {
        __m256i v = _mm256_load_si256((const __m256i*)&bytes[n]);
        v = _mm256_cmpeq_epi8(v, vZero);
        int k = _mm256_movemask_epi8(v);
        count += _mm_popcnt_u32(k);
    }
    return count;
}

#define SIZE 1024

int main()
{
    uint8_t bytes[SIZE] __attribute__((aligned(32)));

    for (int z = 0; z < SIZE; ++z)
        bytes[z] = z % 2;

    int n = countZeroes(bytes, SIZE);
    printf("%d\n", n);

    return 0;
}

You can tighten up the inner loop by using the cmpeq_epi8 results as 0 / -1 integers, summing into vector accumulators and only adding up at the end: [How to count character occurrences using SIMD](https://stackoverflow.com/q/54541129). Less work in the inner loop than movemask + popcnt + scalar-add, at the cost of needing nested loops to avoid overflow with large `length`. — Peter Cordes, Sep 05 '21 at 00:18

Glenn Teitelbaum · Answer 4 · 2014-01-06T16:52:04.443

For situations where 0s are common it would be faster to check 64 bytes at a time, and only check the bytes if the span is non-zero. If zero's are rare this will be more expensive. This code assumes that the large block is divisible by 64. This also assumes that memcmp is as efficient as you can get.

int countZeroBytes(byte[] values, int length)
{
    static const byte zeros[64]={};

    int zeroCount = 0;
    for (int i = 0; i < length; i+=64)
    {
        if (::memcmp(values+i, zeros, 64) == 0)
        {
             zeroCount += 64;
        }
        else
        {
               for (int j=i; j < i+64; ++j)
               {
                     if (!values[j])
                     {
                          ++zeroCount;
                     }
               }
        }
    }

    return zeroCount;
}

Hmm, well sometimes data is entirely zero, other times it's rare. What I like about this though is that I could use it if the previous `zeroCount` was high, or even swap between this approach and another at fixed intervals if `zeroCount` rises above a certain threshold. — Philip Guin, Jan 06 '14 at 00:15

score 1 · Answer 5 · answered Oct 30 '17 at 20:37

Brute force to count zero bytes: Use a vector compare instruction which sets each byte of a vector to 1 if that byte was 0, and to 0 if that byte was not zero.

Do this 255 times to process up to 255 x 64 bytes (if you have 512 bit instruction available, or 255 x 32 or 255 x 16 bytes if you only have 128 bit vectors). And then you just add up the 255 result vectors. Since each byte after the compare had a value of 0 or 1, each sum is at most 255, so you now have one vector of 64 / 32 / 16 bytes, down from about 16,000 / 8,000 / 4,000 bytes.

score 0 · Answer 6 · edited Jan 04 '14 at 23:27

0

It may be faster to avoid the condition and trade it for a look-up and an add:

char isCharZeroLUT[256] = { 1 }; /* 1 0 0 ... */
int zeroCount = 0;
for (int i = 0; i < length; ++i) {
    zeroCount += isCharZeroLUT[values[i]];
}

I haven't measured the differences, though. It is also worth noting that some compiler happily vectorize sufficiently simple loops.

edited Jan 04 '14 at 23:27

user3146587

4,250
1
16
25

answered Jan 04 '14 at 22:56

Dietmar Kühl

150,225
13
225
380

@Jongware: It's possible (likely?) that the compiler is already doing something like this, i.e. avoiding a conditional branch. – Oliver Charlesworth Jan 04 '14 at 22:58
@Jongware: possible. I'm not sure if that can be done by some bit manipulation or uses a condition inside. It is quite possible that it avoids both the condition and the look-up. – Dietmar Kühl Jan 04 '14 at 22:59
1

If your C implementation has the usual definition of the (implementation-defined) right-shift for negative values, `count -= ((unsigned char)values[i]-1)>>8;` will do it. – R.. GitHub STOP HELPING ICE Jan 04 '14 at 23:20
1

@Jongware your code will increase when `values[i] != 0`, so I think `zeroCount += (values[i] == 0)` is more correct – phuclv Jan 05 '14 at 15:01
@LưuVĩnhPhúc: The OP considers "counting *non* zeroes" an equally good solution. The expression I proposed was to avoid explicit comparisons (I am aware one still may occur at assembly level). – Jongware Jan 05 '14 at 15:08

Quickly count number of zero-valued bytes in an array

6 Answers6