Fastest way to check mass data if null in C?

Question

I have a mass of data, maybe 4MB. Now want to check if all bits in it are 0.

Eg: Here is the data:

void* data = malloc(4*1024*1024);
memset(data, 0, 4*1024*1024);

Check if all bits in it are 0. Here is my solution which is not fast enough:

int dataisnull(char* data, int length)
{
    int i = 0;
    while(i<length){
        if (data[i]) return 0;
        i++;
    }
    return 1;
}

This code might have some things to improve in performance. For example, in 32/64 bits machine, checking 4/8 bytes at a time may be faster.

So I wonder what is the fastest way to do it?

Why are you using `void*` instead of `char*` for your data if the data is chars? — Magisch, Feb 17 '16 at 07:23
if you want to zero data at allocation time then `calloc` is much faster than `malloc` then zero — phuclv, Feb 17 '16 at 07:53
IMHO, this is one of those cases where inline assembly is probably your best bet. — Bathsheba, Feb 17 '16 at 08:04
Which compiler do you use? What are the optimization flags? The compiler may do a better job at optimizing than you will without the expense of obfuscating the code (in production, it's undesirable to have "magic code" that noone can debug). — orion, Feb 17 '16 at 08:32
Possible duplicate of [Faster approach to checking for an all-zero buffer in C?](http://stackoverflow.com/questions/1493936/faster-approach-to-checking-for-an-all-zero-buffer-in-c), although that question doesn't have any great answers :/ Also found [Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?](http://stackoverflow.com/questions/15172102/using-c-intel-assembly-what-is-the-fastest-way-to-test-if-a-128-byte-memory-blo) — Peter Cordes, Feb 17 '16 at 11:48
Also, took me forever to find, but I knew I'd seen this recently: [How to get gcc to generate decent code that checks if a buffer is full of NUL bytes?](http://stackoverflow.com/questions/35132492/how-to-get-gcc-to-generate-decent-code-that-checks-if-a-buffer-is-full-of-nul-by). This question itself is a decent answer to the others. It mentions the `int foo(const char usth[static 512])` syntax to try to convince the compiler that it's ok to auto-vectorize in a way that reads data the scalar code wouldn't have. (but which unfortunately gcc and clang don't seem to take advantage of). — Peter Cordes, Feb 17 '16 at 12:13
@SouravGhosh: I believe the memset is just some example test harness code to create the array, and that his real data will not necessarily be all zeros, so it will need checking. — Oddthinking, Feb 17 '16 at 14:10
@LưuVĩnhPhúc *if you want to zero data at allocation time then `calloc` is much faster than `malloc` then zero* One thing to be aware of, though, is `calloc` can provide zero'd *virtual* memory without actually creating the physical mapping. If you actually need the physical mapping for some reason (memory to store data read from a *fast* device, for example), you may actually need the pages to be mapped and actually zero the memory to force the mapping to be created. See http://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc — Andrew Henle, Feb 17 '16 at 15:11
@AndrewHenle: That's true if burning CPU time between allocation and the first `read` is ok. Otherwise, you might as well let the first `read` trigger the copy-on-write allocation of physical pages to back the mapping. Subsequent writes to the buffer (including further `read(2)` system calls) will be just as fast either way. You could also use `madvise(MADV_HUGEPAGE)` to get the memory wired, and as a bonus, have it use transparent hugepages. Interesting reading for the opposite problem (turning dirty pages back into copy-on-write zero pages): http://stackoverflow.com/q/21722545/224132 — Peter Cordes, Feb 17 '16 at 19:32
@Andrew: there's also `mlock`, or `mmap(MAP_LOCKED)` for getting already-wired zeroed pages that won't pagefault when you first touch them. You'd probably want to combine this with `MAP_HUGETLB`, since hugepages can't be swapped out (and I don't think they start COW-mapped to a single zero page). — Peter Cordes, Feb 17 '16 at 19:36
@PeterCordes *That's true if burning CPU time between allocation and the first read is ok. Otherwise, you might as well let the first read trigger the copy-on-write allocation of physical pages to back the mapping* Usually true. I was merely pointing out that `calloc()` doesn't necessarily create physical mappings despite zeroing out the memory. That can matter if the device being read can deliver data faster than the OS can map the pages. Thanks for pointing out other options for getting already-mapped memory. And for mentioning `MAP_HUGETLB`, another high-performance tool too often ignored. — Andrew Henle, Feb 17 '16 at 20:01
I've tried asking Glibc to implement an [API for this](https://sourceware.org/bugzilla/show_bug.cgi?id=21018) but it got rejected. — yugr, Jan 23 '17 at 08:27

chqrlie · Accepted Answer · 2016-02-17T16:25:17.033

14

You can handle multiple bytes at a time and unroll the loop:

int dataisnull(const void *data, size_t length) {
    /* assuming data was returned by malloc, thus is properly aligned */
    size_t i = 0, n = length / sizeof(size_t);
    const size_t *pw = data;
    const unsigned char *pb = data;
    size_t val;
#define UNROLL_FACTOR  8
#if UNROLL_FACTOR == 8
    size_t n1 = n - n % UNROLL_FACTOR;
    for (; i < n1; i += UNROLL_FACTOR) {
        val = pw[i + 0] | pw[i + 1] | pw[i + 2] | pw[i + 3] |
              pw[i + 4] | pw[i + 5] | pw[i + 6] | pw[i + 7];
        if (val)
            return 0;
    }
#endif
    val = 0;
    for (; i < n; i++) {
        val |= pw[i];
    }
    for (i = n * sizeof(size_t); i < length; i++) {
        val |= pb[i];
    }
    return val == 0;
}

Depending on your specific problem, it might be more efficient to detect non zero values early or late:

If the all zero case is the most common, you should compute cumulate all bits into the val accumulator and test only at the end.
If the all zero case is rare, you should check for non zero values more often.

The unrolled version above is a compromise that tests for non zero values every 64 or 128 bytes depending on the size of size_t.

Depending on your compiler and processor, you might get better performance by unrolling less or more. You could also use intrinsic functions available for your particular architecture to take advantage of vector types, but it would be less portable.

Note that the code does not verify proper alignment for the data pointer:

it cannot be done portably.
it assumes the data was allocated via malloc or similar, hence properly aligned for any type.

As always, benchmark different solutions to see if it makes a real difference. This function might not be a bottleneck at all, writing a complex function to optimize a rare case is counterproductive, it makes the code less readable, more likely to contain bugs and much less maintainable. For example, the assumption on data alignment may not hold if you change memory allocation scheme or if you use static arrays, the function may invoke undefined behavior then.

edited Feb 17 '16 at 16:25

answered Feb 17 '16 at 07:36

chqrlie

131,814
10
121
189

Could you use something like [Duff's device](https://en.wikipedia.org/wiki/Duff%27s_device) to parameterize the unroll factor? – MooseBoys Feb 17 '16 at 08:11
3

That code will invoke Undefined Behavior in C99, C11, or C14 if the data in question has an effective type which is something other than a size_t or a signed equivalent. Note that for purpose of C99's aliasing rules, `int` and `long` are different types; even if both are the same size as `size_t`, at most one of them will be an equivalent type for purpose of aliasing rules. Your approach would be a good one in most sensible dialects of C, but not in C99, C11, or C14 as defined by the Standards. – supercat Feb 17 '16 at 08:22
1

Compilers are quite complex and unpredictable. I'm concerned that because the compiler can't verify the alignment of the incoming pointer at compile-time, it will produce expensive instructions (a lot of fast instructions only work on aligned addresses). Also, peeking into the assembler at different levels of optimizations may reveal the compiler is already attempting to unroll the loops and group the branches (speculative execution). – orion Feb 17 '16 at 08:31
4

Is the loop unrolling actually providing a speed up? It looks like something the compiler should be quite good at. GCC has `-funroll-loops` to do this automatically. – Davidmh Feb 17 '16 at 08:59
Thanks for your answering. I have seen this kind of code before. But why it is faster? Is it compiler make something magical for "|" operator? – Ringo_D Feb 17 '16 at 09:26
@supercat What is C14? The new C standard? Even wikipedia doesn't have an article for it. – nalzok Feb 17 '16 at 09:56
1

@Ringo_D: it should be faster because there are fewer test and branches. fewer comparisons with `0`, fewer bound tests because of reading multiple bytes at a time and unrolling the loop. – chqrlie Feb 17 '16 at 10:40
1

@supercat: I confirm that the above core might not work as expected on a DS9K: I couldn't test even try it there because its central unit has not yet recovered from the last UB, daemons are still flying in and out of it. – chqrlie Feb 17 '16 at 10:48
@Davidmh: only careful benchmarking can answer this question. Manual unrolling should still improve performance if the compiler does not unroll the loop automatically. – chqrlie Feb 17 '16 at 10:50
gcc 5.3 `-O3` does some of [the most perverse auto-vectorization I've ever seen on this function](http://goo.gl/1fNS4n). I think it doesn't realize that the cleanup loop is just a cleanup loop and not worth vectorizing. In any case, since you do `val |= a byte`, it unpacks the bytes to 64bit vector elements with a chain of `vextracti128`, 2x `vpmovzxbw`, 4x `vpmovzxwd`, and 8x `vpmovzxdq` before ORing them together! I haven't even found the main loop yet, it's probably tiny. – Peter Cordes Feb 17 '16 at 11:28
Also, why are you doing anything with `CHAR_BIT`? I don't follow the logic. `sizeof`'s result is in units of `char`s, not bits. Other than that, ORing some chunks together to improve branch-predictor patterns is probably a good idea. Like maybe a cache-line at a time of data (64bytes on all modern x86 CPUs). Your unroll seems excessive; 8 should be plenty. gcc likes to go berserk setting up aligned pointers for auto-vectorized loops, which leads to really bloated code. – Peter Cordes Feb 17 '16 at 11:34
1

@PeterCordes The main loop is just between lines 18 and 40. – orion Feb 17 '16 at 13:17
@PeterCordes: you are absolutely right. I fixed the `CHAR_BIT` mess and downsized the unrolling. – chqrlie Feb 17 '16 at 16:27
Here's [an attempt to get gcc to generate less-horrible code](http://goo.gl/uNxSW5). Using a byte accumulator for the byte-loop helps avoids the perverse unpacking when it auto-vectorizes. Skipping the word-cleanup loop is good, since gcc is going to auto-vectorize the byte loop anyway. The code is not bad with AVX2, but nasty with avx or with just SSE. – Peter Cordes Feb 17 '16 at 19:45
@PeterCordes: it is amazing how difficult it has become to make gcc or clang generate readable assembly code nowadays. Bloat is everywhere, for questionable performance gains. – chqrlie Feb 17 '16 at 21:36
@chqrlie: on one missed-optimization bug I reported a while ago, one of the gcc devs said they should probably have gcc just use unaligned loads/stores for x86 in a lot of cases, instead of peeling off so many loop iterations to set up an aligned vector loop. That's faster in the common case where the buffer *is* aligned, despite the lack of guarantee. Fast hardware support for unaligned loads/stores makes it pretty competitive with scalar startup/cleanup loops (esp. the clunky ones gcc generates) for non-gigantic buffer sizes. – Peter Cordes Feb 17 '16 at 21:53
1

And yeah, locally-optimal code that will win a micro-benchmark (with hot caches and primed branch prediction) is not the same thing as efficient code for part of a larger program. Instruction cache, and especially uop-cache, are valuable resources that compilers tend to squander. I haven't done a lot of looking at profile-directed optimization output, but hopefully it only bloats functions that are actually hot with `-O3`, otherwise using `-O3` for programs that don't have just one hot loop is pretty questionable. – Peter Cordes Feb 17 '16 at 21:55
[gcc bug report submitted](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908) about how it's apparently impossible to write a function for this that compiles nicely, and about the perverse auto-vectorization for this specific attempt. – Peter Cordes Feb 22 '16 at 22:49

Nicu Stiurca · Answer 2 · 2016-02-17T08:49:08.003

5

The following checks if the first byte is what you want, and all subsequent pairs of bytes are the same.

int check_bytes(const char * const data, size_t length, const char val)
{
    if(length == 0) return 1;
    if(*data != val) return 0;
    return memcmp(data, data+1, length-1) ? 0 : 1;
}

int check_bytes64(const char * const data, size_t length, const char val)
{
    const char * const aligned64_start = (char *)((((uintptr_t)data) + 63) / 64 * 64);
    const char * const aligned64_end = (char *)((((uintptr_t)data) + length) / 64 * 64);
    const size_t start_length = aligned64_start - data;
    const size_t aligned64_length = aligned64_end - aligned64_start;
    const size_t end_length = length - start_length - aligned64_length;

    if (!check_bytes(data, start_length, val)) return 0;
    if (!check_bytes(aligned64_end, end_length, val)) return 0;

    return memcmp(aligned64_start, aligned64_start + 64, aligned64_length-64) ? 0 : 1;
}

A more elaborate version of this function should probably pass cache-line-aligned pointers to memcmp, and manually check the remaining blocks(s) instead of just the first byte.

Of course, you will have to profile on your specific hardware to see if there is any speed benefit of this method vs others.

If anyone doubts whether this works, ideone.

edited Feb 17 '16 at 08:49

answered Feb 17 '16 at 07:26

Nicu Stiurca

8,747
8
40
48

I also don't know the reason of the downvote. The idea seems really smart :) and I personally like it. Therefore one upvote from me. Perhaps some insignificant type cast is required, but this is only cosmetic improvement. – dmi Feb 17 '16 at 07:30
4

This solution is not really efficient: `memcmp` is given unaligned pointers: it cannot use optimized code that compares multiple byte at a time efficiently. Furthermore, comparing bytes from memory is less efficient than comparing bytes to a constant value. I did not downvote, but I understand those who did. – chqrlie Feb 17 '16 at 07:53
1

@chqrlie not necessarily, it depends on the implementation of memcmp. Some of them check if pointers are aligned and change the algorithm accordingly. – Jabberwocky Feb 17 '16 at 07:59
@chqrlie, in my opinion memcmp should be much optimized comparing to the solution given by the topic starter for a particular platform. Besides that, the question refers to the memory block allocated by malloc, and thus the memcmp is leagable to be used with respect to the mem alignment. Thus I find the answer still very good. Perhaps the applicability of the proposed solution for particular task should be considered, but this IMHO is out of topic. – dmi Feb 17 '16 at 08:01
5

@MichaelWalz I think his point was that because of `data, data+1`, both pointers *cannot* be aligned. – user694733 Feb 17 '16 at 08:06
@chqrlie: The function should be written to allow both pointers to be aligned, but other than that the performance of any approach not using compiler intrinsics will be limited by the fact that even on platforms which have unsigned integer types with no padding, the Standard forbids the use of any such type larger than "unsigned char" to read data written as other types even when pointers are perfectly aligned. – supercat Feb 17 '16 at 08:26
1

@supercat Strict aliasing rules (or any other language rules) don't apply to implementation of C standard library functions. As long as `memcmp` behaves as standard says, code inside can use any types it wishes. Functions don't even have to be written in C. – user694733 Feb 17 '16 at 09:14
@MichaelWalz: that's exactly what I am hinting at, the naive `check_bytes` calls `memcmp` with pointers only 1 byte apart, defeating any such attempt for alignment based optimizations. – chqrlie Feb 17 '16 at 10:54
@user694733: You noted that an approach comparing bytes in memory to other bytes in memory will be less efficient than comparisons to a constant. While that should be the case, my point was that since there are no compiler intrinsics to scan for the first thing that doesn't match a constant, and since code which doesn't use compiler intrinsics is (IMHO annoyingly and stupidly) limited to processing one byte at a time, memcmp may, even with such overhead, come out ahead of anything that could be written without the use of compiler intrinsics, – supercat Feb 17 '16 at 17:00

5gon12eder · Answer 3 · 2016-02-17T08:33:19.453

I once wrote the following function for my own use. It assumes that the data to check is a multiple of a constant chunk size and aligned properly for a buffer of machine words. If this is not given in your case, it is not hard to loop for the first and last few bytes individually and only check the bulk with the optimized function. (Strictly speaking, it is undefined behavior even if the array is properly aligned but the data has been written by any type that is incompatible with unsigned long. However, I believe that you can get pretty far with this careful breaking of the rules here.)

#include <assert.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>

bool
is_all_zero_bulk(const void *const p, const size_t n)
{
  typedef unsigned long word_type;
  const size_t word_size = sizeof(word_type);
  const size_t chunksize = 8;
  assert(n % (chunksize * word_size) == 0);
  assert((((uintptr_t) p) & 0x0f) == 0);
  const word_type *const frst = (word_type *) p;
  const word_type *const last = frst + n / word_size;
  for (const word_type * iter = frst; iter != last; iter += chunksize)
    {
      word_type acc = 0;
      // Trust the compiler to unroll this loop at its own discretion.
      for (size_t j = 0; j < chunksize; ++j)
        acc |= iter[j];
      if (acc != 0)
        return false;
    }
  return true;
}

The function itself is not very smart. The main ideas are:

Use large unsigned machine words for data comparison.
Enable loop unrolling by factoring out an inner loop with a constant iteration count.
Reduce the number of branches by ORing the words into an accumulator and only comparing it every few iterations against zero.
This should also make it easy for the compiler to generate vectorized code using SIMD instructions which you really want for code like this.

Additional non-standard tweaks would be to annotate the function with __attribute__ ((hot)) and use __builtin_expect(acc != 0, false). Of course, the most important thing is to turn on your compiler's optimizations.

Note that in C99, C11, and C14 the function will invoke Undefined Behavior if the data was written as any type other than word_type or a signed equivalent thereof. Maybe someday the Standards Committee will acknowledge that a decent language should include a means of using types larger than "char" for purposes of manipulating "raw memory", but it hasn't yet. — supercat, Feb 17 '16 at 08:28
@supercat Yes, I know. It is a “fingers crossed” approach. — 5gon12eder, Feb 17 '16 at 08:30

Fastest way to check mass data if null in C?

3 Answers3

Linked