21

I have an array of bytes, in memory. What's the fastest way to see if all the bytes in the array are zero?

Claudiu
  • 224,032
  • 165
  • 485
  • 680

7 Answers7

30

Nowadays, short of using SIMD extensions (such as SSE on x86 processors), you might as well iterate over the array and compare each value to 0.

In the distant past, performing a comparison and conditional branch for each element in the array (in addition to the loop branch itself) would have been deemed expensive and, depending on how often (or early) you could expect a non-zero element to appear in the array, you might have elected to completely do without conditionals inside the loop, using solely bitwise-or to detect any set bits and deferring the actual check until after the loop completes:

int sum = 0;
for (i = 0; i < ARRAY_SIZE; ++i) {
  sum |= array[i];
}
if (sum != 0) {
  printf("At least one array element is non-zero\n");
}

However, with today's pipelined super-scalar processor designs complete with branch prediction, all non-SSE approaches are virtualy indistinguishable within a loop. If anything, comparing each element to zero and breaking out of the loop early (as soon as the first non-zero element is encountered) could be, in the long run, more efficient than the sum |= array[i] approach (which always traverses the entire array) unless, that is, you expect your array to be almost always made up exclusively of zeroes (in which case making the sum |= array[i] approach truly branchless by using GCC's -funroll-loops could give you the better numbers -- see the numbers below for an Athlon processor, results may vary with processor model and manufacturer.)

#include <stdio.h>

int a[1024*1024];

/* Methods 1 & 2 are equivalent on x86 */  

int main() {
  int i, j, n;

# if defined METHOD3
  int x;
# endif

  for (i = 0; i < 100; ++i) {
#   if defined METHOD3
    x = 0;
#   endif
    for (j = 0, n = 0; j < sizeof(a)/sizeof(a[0]); ++j) {
#     if defined METHOD1
      if (a[j] != 0) { n = 1; }
#     elif defined METHOD2
      n |= (a[j] != 0);
#     elif defined METHOD3
      x |= a[j];
#     endif
    }
#   if defined METHOD3
    n = (x != 0);
#   endif

    printf("%d\n", n);
  }
}

$ uname -mp
i686 athlon
$ gcc -g -O3 -DMETHOD1 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 test.c
$ time ./a.out
real    0m0.377s
user    0m0.372s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s

$ gcc -g -O3 -DMETHOD1 -funroll-loops test.c
$ time ./a.out
real    0m0.351s
user    0m0.348s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 -funroll-loops test.c
$ time ./a.out
real    0m0.343s
user    0m0.340s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 -funroll-loops test.c
$ time ./a.out
real    0m0.209s
user    0m0.206s
sys     0m0.003s
vladr
  • 65,483
  • 18
  • 129
  • 130
  • What's up with threads? Would it make even more faster? – Kobor42 Dec 03 '13 at 15:45
  • Threads are heavy to setup, won't worth it unless it's a very big array (cf http://stackoverflow.com/questions/3929774/how-much-overhead-is-there-when-creating-a-thread) – Taiki Apr 17 '14 at 11:45
  • not even mentionning the fact that if you didnt allocate your array in NUMA parts it will serialize access. if its in L3 though you have a chance. – v.oddou Mar 30 '15 at 06:29
12

Here's a short, quick solution, if you're okay with using inline assembly.

#include <stdio.h>

int main(void) {
    int checkzero(char *string, int length);
    char str1[] = "wow this is not zero!";
    char str2[] = {0, 0, 0, 0, 0, 0, 0, 0};
    printf("%d\n", checkzero(str1, sizeof(str1)));
    printf("%d\n", checkzero(str2, sizeof(str2)));
}

int checkzero(char *string, int length) {
    int is_zero;
    __asm__ (
        "cld\n"
        "xorb %%al, %%al\n"
        "repz scasb\n"
        : "=c" (is_zero)
        : "c" (length), "D" (string)
        : "eax", "cc"
    );
    return !is_zero;
}

In case you're unfamiliar with assembly, I'll explain what we do here: we store the length of the string in a register, and ask the processor to scan the string for a zero (we specify this by setting the lower 8 bits of the accumulator, namely %%al, to zero), reducing the value of said register on each iteration, until a non-zero byte is encountered. Now, if the string was all zeroes, the register, too, will be zero, since it was decremented length number of times. However, if a non-zero value was encountered, the "loop" that checked for zeroes terminated prematurely, and hence the register will not be zero. We then obtain the value of that register, and return its boolean negation.

Profiling this yielded the following results:

$ time or.exe

real    0m37.274s
user    0m0.015s
sys     0m0.000s


$ time scasb.exe

real    0m15.951s
user    0m0.000s
sys     0m0.046s

(Both test cases ran 100000 times on arrays of size 100000. The or.exe code comes from Vlad's answer. Function calls were eliminated in both cases.)

susmits
  • 2,210
  • 2
  • 23
  • 27
  • What if we take this bitmagic approach and combine with threads? Could you give this task to a threadpool? – Kobor42 Dec 03 '13 at 15:44
4

Split the checked memory half, and compare the first part to the second.
a. If any difference, it can't be all the same.
b. If no difference repeat for the first half.

Worst case 2*N. Memory efficient and memcmp based.
Not sure if it should be used in real life, but I liked the self-compare idea.
It works for odd length. Do you see why? :-)

bool memcheck(char* p, char chr, size_t size) {
    // Check if first char differs from expected.
    if (*p != chr) 
        return false;
    int near_half, far_half;
    while (size > 1) {
        near_half = size/2;
        far_half = size-near_half;
        if (memcmp(p, p+far_half, near_half))
            return false;
        size = far_half;
    }
    return true;
}
Kobor42
  • 5,129
  • 1
  • 17
  • 22
  • you should also check whether the first element is 0, otherwise it'll return true for anything where each byte is the same, won't it? – Claudiu Mar 29 '13 at 14:45
  • 1
    also it has `n + n/2 + n/4 + ...` operations which would just be `2n` at most, so it's still `O(n)` i think... – Claudiu Mar 29 '13 at 14:47
  • Sorry, had some edits. Now it's final. Clau, the first char is checked. "return *p == chr;". You are right about the O(N). – Kobor42 Mar 29 '13 at 14:53
  • ah i didn't see that, i was looking for a `'0'` literal but this checks if the array is all of any given char – Claudiu Mar 29 '13 at 15:37
  • Yeah, 'cause there is a memset with extra parameter to fill, so why not have that for memcheck too? Who uses memset for other than zero? The same may use memcheck for something else. :-) – Kobor42 May 21 '13 at 11:06
  • 3
    This algorithm compare every byte and does many out of order memory loads. As it is `O(2n-1)`=`O(n)+O(n/2)+O(n/4)+...`, something that just compares each byte (or words/dwords, etc) versus a register will be faster. Any algorithm will be memory constrained (for the positive case), so minimizing memory cycles will give the biggest gain. The `memcmp()` attempts to hide complexity; it itself is `O(n)` for memory accesses. – artless noise Apr 17 '14 at 15:12
  • On real world data this will end in O(n/2). But anyway, thanks for summarizing the discussion so far. :-) – Kobor42 Apr 18 '14 at 14:23
4

If you want to do this in 32-bit C, probably just loop over the array as a 32-bit integer array and compare it to 0, then make sure the stuff at the end is also 0.

WhirlWind
  • 13,974
  • 3
  • 42
  • 42
  • 1
    Note that this is *technically* platform dependent though I can't think of a platform where it wouldn't work. +1 – Billy ONeal Apr 07 '10 at 03:03
  • Billy - I agree, but I'm guessing it's ok, since it's tagged 32bit. – WhirlWind Apr 07 '10 at 03:04
  • 5
    In fact, just use a plain for loop on char and compile with `-funroll-loops` and the compiler will do the right thing for you. – J-16 SDiZ Apr 07 '10 at 03:06
  • @Billy ONeal: If "integer" means `int`, then it won't work on any platform that uses sign-magnitude integers, since the bit patterns for 0 and -0 can't *both* be all zeros, but they compare equal. So you get false positives. I can't name such a platform off the top of my head, though, and I don't really expect ever to use one. You can fix that particular issue by loading unsigned int, or perhaps better `uint32_t`, since that is not permitted to have padding bits. – Steve Jessop Apr 07 '10 at 10:34
  • 3
    @J-16: The question REQUIRED a fast version. As a professional game programmer who has spent man years in optimizing code, I can tell you that writing the code naively and using a compiler flag like "-funroll-loops" only generates optimal code about 1% of the time. Most of the time you have to help the compiler out. – Adisak Apr 08 '10 at 15:54
  • This answer assumes the length of the array is a multiple of the size of an integer and that the array is aligned. Enforcing alignment and padding isn't always an option. Sometimes one need to search for zeros in an array where you don't control the length and alignment, and where the bytes next to it on either end contains unrelated data. – kasperd Feb 22 '15 at 08:57
3

If the array is of any decent size, your limiting factor on a modern CPU is going to be access to the memory.

Make sure to use cache prefetching for a decent distance ahead (i.e. 1-2K) with something like __dcbt or prefetchnta (or prefetch0 if you are going to use the buffer again soon).

You will also want to do something like SIMD or SWAR to or multiple bytes at a time. Even with 32-bit words, it will be 4X less operations than a per character version. I'd recommend unrolling the or's and making them feed into a "tree" of or's. You can see what I mean in my code example - this takes advantage of superscalar capability to do two integer ops (the or's) in parallel by making use of ops that do not have as many intermediate data dependencies. I use a tree size of 8 (4x4, then 2x2, then 1x1) but you can expand that to a larger number depending on how many free registers you have in your CPU architecture.

The following pseudo-code example for the inner loop (no prolog/epilog) uses 32-bit ints but you could do 64/128-bit with MMX/SSE or whatever is available to you. This will be fairly fast if you have prefetched the block into the cache. Also you will possibly need to do unaligned check before if your buffer is not 4-byte aligned and after if your buffer (after alignment) is not a multiple of 32-bytes in length.

const UINT32 *pmem = ***aligned-buffer-pointer***;

UINT32 a0,a1,a2,a3;
while(bytesremain >= 32)
{
    // Compare an aligned "line" of 32-bytes
    a0 = pmem[0] | pmem[1];
    a1 = pmem[2] | pmem[3];
    a2 = pmem[4] | pmem[5];
    a3 = pmem[6] | pmem[7];
    a0 |= a1; a2 |= a3;
    pmem += 8;
    a0 |= a2;
    bytesremain -= 32;
    if(a0 != 0) break;
}

if(a0!=0) then ***buffer-is-not-all-zeros***

I would actually suggest encapsulating the compare of a "line" of values into a single function and then unrolling that a couple times with the cache prefetching.

Adisak
  • 6,708
  • 38
  • 46
2

Measured two implementations on ARM64, one using a loop with early return on false, one that ORs all bytes:

int is_empty1(unsigned char * buf, int size)
{
    int i;
    for(i = 0; i < size; i++) {
        if(buf[i] != 0) return 0;
    }
    return 1;
}

int is_empty2(unsigned char * buf, int size)
{
    int sum = 0;
    for(int i = 0; i < size; i++) {
        sum |= buf[i];
    }
    return sum == 0;
}

Results:

All results, in microseconds:

        is_empty1   is_empty2
MEDIAN  0.350       3.554
AVG     1.636       3.768

only false results:

        is_empty1   is_empty2
MEDIAN  0.003       3.560
AVG     0.382       3.777

only true results:

        is_empty1   is_empty2
MEDIAN  3.649       3,528
AVG     3.857       3.751

Summary: only for datasets where the probability of false results is very small, the second algorithm using ORing performs better, due to the omitted branch. Otherwise, returning early is clearly the outperforming strategy.

Ortwin Gentz
  • 52,648
  • 24
  • 135
  • 213
1

Rusty Russel's memeqzero is very fast. It reuses memcmp to do the heavy lifting: https://github.com/rustyrussell/ccan/blob/master/ccan/mem/mem.c#L92.

zbyszek
  • 5,105
  • 1
  • 27
  • 22