22

I am searching for a faster method of accomplishing this:

int is_empty(char * buf, int size) 
{
    int i;
    for(i = 0; i < size; i++) {
        if(buf[i] != 0) return 0;
    }
    return 1;
}

I realize I'm searching for a micro optimization unnecessary except in extreme cases, but I know a faster method exists, and I'm curious what it is.

Rob
  • 5,223
  • 5
  • 41
  • 62
  • 31
    Joke answer: `int is_empty(char *buf, int size) { memset(buf, 0, size); return 1; }` – Chris Lutz Sep 29 '09 at 17:31
  • How is that buffer used? – Cat Plus Plus Sep 29 '09 at 17:31
  • 10
    As a side note, you should really use `size_t` instead of `int` for values that represent the size of arrays. – Chris Lutz Sep 29 '09 at 17:32
  • 12
    Sounds like a job for Duff's Device! – Michael Burr Sep 29 '09 at 17:35
  • Re: PiotrLegnica -- it's used for a small assignment in implementing a Unix cp command through system calls. However, the implementation cp needs to preserve 'holes' in a file without actually writing out zeroes to the disk. The code checks if the buffer is empty, and if so lseek(). Else write(). – Rob Sep 29 '09 at 17:36
  • 2
    @Rob - you'll most likely be disk bound and shouldn't focus on this. A 10 ms read or write to disk is millions and millions of clock cycles on a modern CPU. Optimizing how you read off of the disk will get you much better results. – Michael Sep 29 '09 at 18:20
  • 2
    @Rob: http://kernelnewbies.org/Linux_2_6_28 look 1.13 FIEMAP – sambowry Sep 29 '09 at 19:47
  • [The best answer on a recent duplicate of this](http://stackoverflow.com/a/35450498/224132) is worth looking at. It compiles to not-bad code, although there is a lot of bloated auto-vectorized code for the cleanup loops that never actually runs. See [this gcc bug report](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908) for gcc sucking at zero-checking functions in general, and that one in particular. Clang sucks at it too, unfortunately. – Peter Cordes Feb 22 '16 at 22:52

19 Answers19

31

On many architectures, comparing 1 byte takes the same amount of time as 4 or 8, or sometimes even 16. 4 bytes is normally easy (either int or long), and 8 is too (long or long long). 16 or higher probably requires inline assembly to e.g., use a vector unit.

Also, a branch mis-predictions really hurt, it may help to eliminate branches. For example, if the buffer is almost always empty, instead of testing each block against 0, bit-or them together and test the final result.


Expressing this is difficult in portable C: casting a char* to long* violates strict aliasing. But fortunately you can use memcpy to portably express an unaligned multi-byte load that can alias anything. Compilers will optimize it to the asm you want.

For example, this work-in-progress implementation (https://godbolt.org/z/3hXQe7) on the Godbolt compiler explorer shows that you can get a good inner loop (with some startup overhead) from loading two consecutive uint_fast32_t vars (often 64-bit) with memcpy and then checking tmp1 | tmp2, because many CPUs will set flags according to an OR result, so this lets you check two words for the price of one.

Getting it to compile efficiently for targets without efficient unaligned loads requires some manual alignment in the startup code, and even then gcc may not inline the memcpy for loads where it can't prove alignment.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
derobert
  • 49,731
  • 15
  • 94
  • 124
  • 7
    +1 for branch information, since the poster is looking for micro-optimization. – Ben M Sep 29 '09 at 17:49
  • 2
    Let it be noted that comparing groups of 1 bytes as a single integral value brings up issues of alignment, which makes it difficult to do safely. – Chris Lutz Sep 29 '09 at 18:03
  • There is no way you could really optimize this loop in regards to branch prediction. A modern processor will favor taking a backwards jump over falling through if it has no prediction information and after that it will continue correctly predict the branch until the loop ends. The end result is one misprediction per buffer, I can't imagine a way to get this down to 0. – Falaina Sep 29 '09 at 18:50
  • 4
    He's not talking about prediction of the loop, he's talking about prediction of testing that the buffer contents are zero (and improving it by doing *fewer* comparisons) – Stephen Canon Sep 29 '09 at 20:38
  • 3
    Alignment doesn't complicate things too much. Process bytes one at a time until you have the necessary alignment, then fall into a main loop that uses larger comparisons, and have some single-byte comparisons to clean up the leftover items when you're done. Alignment is only a substantial pain when you're dealing with multiple buffers that may have different relative alignments. – Stephen Canon Sep 29 '09 at 20:40
  • @StephenCanon exactly, that even has a name in simd programming. but this is nowhere near "uncomplicated" like you say. this causes the use of branching everywhere in the loop which hurts the performance in case of small buffers. – v.oddou Dec 10 '15 at 09:31
  • @v.oddou: right. The efficient way is to test the first 8 or 16 bytes of the buffer (regardless of alignment), then start from the first *aligned* 16-byte block. It doesn't matter if you check some bytes twice because of overlap, that's still much more efficient than a scalar byte loop whose trip-count depends on alignment. zero-checking is idempotent, so you only need one branch, to check if `size`>=16. – Peter Cordes Sep 26 '18 at 04:56
14

One potential way, inspired by Kieveli's dismissed idea:

int is_empty(char *buf, size_t size)
{
    static const char zero[999] = { 0 };
    return !memcmp(zero, buf, size > 999 ? 999 : size);
}

Note that you can't make this solution work for arbitrary sizes. You could do this:

int is_empty(char *buf, size_t size)
{
    char *zero = calloc(size);
    int i = memcmp(zero, buf, size);
    free(zero);
    return i;
}

But any dynamic memory allocation is going to be slower than what you have. The only reason the first solution is faster is because it can use memcmp(), which is going to be hand-optimized in assembly language by the library writers and will be much faster than anything you could code in C.

EDIT: An optimization no one else has mentioned, based on earlier observations about the "likelyness" of the buffer to be in state X: If a buffer isn't empty, will it more likely not be empty at the beginning or the end? If it's more likely to have cruft at the end, you could start your check at the end and probably see a nice little performance boost.

EDIT 2: Thanks to Accipitridae in the comments:

int is_empty(char *buf, size_t size)
{
    return buf[0] == 0 && !memcmp(buf, buf + 1, size - 1);
}

This basically compares the buffer to itself, with an initial check to see if the first element is zero. That way, any non-zero elements will cause memcmp() to fail. I don't know how this would compare to using another version, but I do know that it will fail quickly (before we even loop) if the first element is nonzero. If you're more likely to have cruft at the end, change buf[0] to buf[size] to get the same effect.

Community
  • 1
  • 1
Chris Lutz
  • 73,191
  • 16
  • 130
  • 183
  • 4
    On the other hand, the memcmp version is going to be bound by the memory bandwidth, and so will the smarter C version if the compiler is not too stupid, but the C version will have *half* the memory to read. Without trying, I predict the C version to be at least 50% faster than the memcmp one. – Pascal Cuoq Sep 29 '09 at 17:50
  • a) What "C" versions are you comparing? b) Why would `memcmp()` take longer? – Chris Lutz Sep 29 '09 at 17:53
  • 18
    Instead of allocating memory you could use: buf[0]==0 && !memcmp(buf, buf+1, size-1). – Accipitridae Sep 29 '09 at 18:36
  • 1
    @Pascal - Timing it, the first `memcmp()` version is almost the same speed as the OP's version, clocking it consistently _slightly_ (like, 0.001s) faster. – Chris Lutz Sep 29 '09 at 18:42
  • You might want to make that `static` array `const`, too. That version is trivially inline-able, which could be another (tiny) win. – caf Sep 29 '09 at 22:30
  • For the shifted memcmp version, you could cast to int* and use sizeof(int) instead of 1. worth profiling, maybe the leading answer says more. – fuzzyTew May 10 '21 at 08:21
13

The benchmarks given above (https://stackoverflow.com/a/1494499/2154139) are not accurate. They imply that func3 is much faster than the other options.

However, if you change the order of the tests, so that func3 comes before func2, you'd see func2 is much faster.

Careful when running combination benchmarks within a single execution... the side effects are large, especially when reusing the same variables. Better to run the tests isolated!

For example, changing it to:

int main(){
  MEASURE( func3 );
  MEASURE( func3 );
  MEASURE( func3 );
  MEASURE( func3 );
  MEASURE( func3 );
}

gives me:

func3: zero          14243
func3: zero           1142
func3: zero            885
func3: zero            848
func3: zero            870

This was really bugging me as I couldn't see how func3 could perform so much faster than func2.

(apologize for the answer, and not as a comment, didn't have reputation)

InfinitoTaco
  • 131
  • 1
  • 2
12

Four functions for testing zeroness of a buffer with simple benchmarking:

#include <stdio.h> 
#include <string.h> 
#include <wchar.h> 
#include <inttypes.h> 

#define SIZE (8*1024) 
char zero[SIZE] __attribute__(( aligned(8) ));

#define RDTSC(var)  __asm__ __volatile__ ( "rdtsc" : "=A" (var)); 

#define MEASURE( func ) { \ 
  uint64_t start, stop; \ 
  RDTSC( start ); \ 
  int ret = func( zero, SIZE ); \ 
  RDTSC( stop ); \ 
  printf( #func ": %s   %12"PRIu64"\n", ret?"non zero": "zero", stop-start ); \ 
} 


int func1( char *buff, size_t size ){
  while(size--) if(*buff++) return 1;
  return 0;
}

int func2( char *buff, size_t size ){
  return *buff || memcmp(buff, buff+1, size-1);
}

int func3( char *buff, size_t size ){
  return *(uint64_t*)buff || memcmp(buff, buff+sizeof(uint64_t), size-sizeof(uint64_t));
}

int func4( char *buff, size_t size ){
  return *(wchar_t*)buff || wmemcmp((wchar_t*)buff, (wchar_t*)buff+1, size/sizeof(wchar_t)-1);
}

int main(){
  MEASURE( func1 );
  MEASURE( func2 );
  MEASURE( func3 );
  MEASURE( func4 );
}

Result on my old PC:

func1: zero         108668
func2: zero          38680
func3: zero           8504
func4: zero          24768
sambowry
  • 2,436
  • 1
  • 16
  • 13
  • 4
    intriguing, but func3 and func4 need some minor fixup to address non-integer multiples of wordsize, and alignment issues – Jason S Sep 30 '09 at 22:09
  • 2
    See @dissasorted's answer below for an observation of this benchmarking: http://stackoverflow.com/a/29382742/1185900 – Andrew Smart May 01 '16 at 05:08
  • This can fault on 1-byte buffer right before an unmapped page. But yes, interesting idea to leverage asm-optimized library functions. Checking at 16 or 32 byte offset might be better for large inputs so the library function doesn't need to do misaligned SIMD vectors. **However, `*(uint64_t*)buff` violates strict aliasing** - you need to load it with `memcpy` or declare a GNU C `may_alias` version of the type: [Why does glibc's strlen need to be so complicated to run quickly?](//stackoverflow.com/a/57676035) – Peter Cordes Nov 19 '19 at 18:15
  • See [How to get the CPU cycle count in x86\_64 from C++?](//stackoverflow.com/q/13772567) though. `"=A"` doesn't do what you want for rdtsc in 64-bit code. Also beware of CPU-frequency warm-up effects: slow at first then fast. – Peter Cordes Nov 19 '19 at 18:17
10

If your program is x86 only or x64 only, you can easily optimize using inline assambler. The REPE SCASD instruction will scan a buffer until a non EAX dword is found.

Since there is no equivalent standard library function, no compiler/optimizer will probably be able to use these instructions (as Confirmed by Sufian's code).

From the head, something like this would do if your buffer length is 4-bytes aligned (MASM syntax):

_asm {
   CLD                ; search forward
   XOR EAX, EAX       ; search for non-zero
   LEA EDI, [buf]     ; search in buf
   MOV ECX, [buflen]  ; search buflen bytes
   SHR ECX, 2         ; using dwords so len/=4
   REPE SCASD         ; perform scan
   JCXZ bufferEmpty:  ; completes? then buffer is 0
}

Tomas

EDIT: updated with Tony D's fixes

Tomas
  • 5,067
  • 1
  • 35
  • 39
  • 1
    Best answer - upvoted. Could you show me how to put this into a GCC/Clang C program as embedded ASM? (Does this require GAS syntax?) Is there a 64 bit version using unsigned long longs? – fadedbee Feb 22 '13 at 10:37
  • 1
    +1 solid approach - nit picks: should be `LEA EDI, ...` not ESI, and missing CLD to guarantee the direction flag is clear. – Tony Delroy Apr 26 '13 at 10:22
  • 2
    careful inline assembler is not static analyzable by compilers and cause a full ordering memory fence before and after the block ! this can be horrible in some cases. – v.oddou Dec 10 '15 at 09:35
  • 2
    Unfortunately, `repe scasd` [isn't faster than a normal loop](http://agner.org/optimize). String store and copy instructions (`rep stos` and `rep movs`) have optimized microcode implementations in Intel CPUs, but `repe scas` and `repe cmps` don't. Also @TonyD: All the normal ABIs require the direction flag to be cleared on function entry. `cld` is a waste of an instruction. – Peter Cordes Jan 28 '16 at 06:54
8

For something so simple, you'll need to see what code the compiler is generating.

$ gcc -S -O3 -o empty.s empty.c

And the contents of the assembly:

        .text
        .align 4,0x90
.globl _is_empty
_is_empty:
        pushl       %ebp
        movl        %esp, %ebp
        movl        12(%ebp), %edx  ; edx = pointer to buffer
        movl        8(%ebp), %ecx   ; ecx = size
        testl       %edx, %edx
        jle L3
        xorl        %eax, %eax
        cmpb        $0, (%ecx)
        jne L5
        .align 4,0x90
L6:
        incl        %eax            ; real guts of the loop are in here
        cmpl        %eax, %edx
        je  L3
        cmpb        $0, (%ecx,%eax) ; compare byte-by-byte of buffer
        je  L6
L5:
        leave
        xorl        %eax, %eax
        ret
        .align 4,0x90
L3:
        leave
        movl        $1, %eax
        ret
        .subsections_via_symbols

This is very optimized. The loop does three things:

  • Increase the offset
  • Compare the offset to the size
  • Compare the byte-data in memory at base+offset to 0

It could be optimized slightly more by comparing at a word-by-word basis, but then you'd need to worry about alignment and such.

When all else fails, measure first, don't guess.

Sufian
  • 8,627
  • 4
  • 22
  • 24
  • 2
    It could be optimized MUCH more by using string instruction REP SCASx – Tomas May 17 '10 at 09:23
  • @Tomas: only by virtue of checking multiple bytes at a time with `repe scasd`. Also, this asm kinda sucks. It uses a 2-register addressing mode, but still does a compare-and-branch instead of using the flags set by `inc`. Better would be to get a pointer to the end of the array and count a negative index up towards zero. Also, Sufian: doing a word or dword at a time would be a huge optimization, not just slight. Factor of two (word) or four (dword) for large buffers. Also, modern Intel CPUs [apparently can't micro-fuse 2-reg addressing modes](http://stackoverflow.com/q/26046634/224132) – Peter Cordes Jan 28 '16 at 07:03
6

Try checking the buffer using an int-sized variable where possible (it should be aligned).

Off the top of my head (uncompiled, untested code follows - there's almost certainly at least one bug here. This just gives the general idea):

/* check the start of the buf byte by byte while it's unaligned */
while (size && !int_aligned( buf)) {
    if (*buf != 0) {
        return 0;
    }

    ++buf;
    --size;
}


/* check the bulk of the buf int by int while it's aligned */

size_t n_ints = size / sizeof( int);
size_t rem = size / sizeof( int);

int* pInts = (int*) buf;

while (n_ints) {
    if (*pInt != 0) {
        return 0;
    }

    ++pInt;
    --n_ints;
}


/* now wrap up the remaining unaligned part of the buf byte by byte */

buf = (char*) pInts;

while (rem) {
    if (*buf != 0) {
        return 0;
    }

    ++buf;
    --rem;
}

return 1;
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
5

With x86 you can use SSE to test 16 bytes at a time:

#include "smmintrin.h" // note: requires SSE 4.1

int is_empty(const char *buf, const size_t size) 
{
    size_t i;
    for (i = 0; i + 16 <= size; i += 16)
    {
        __m128i v = _mm_loadu_si128((m128i *)&buf[i]);
        if (!_mm_testz_si128(v, v))
            return 0;
    }
    for ( ; i < size; ++i)
    {
        if (buf[i] != 0)
            return 0;
    }
    return 1;
}

This can probably be further improved with loop unrolling.

On modern x86 CPUs with AVX you can even use 256 bit SIMD and test 32 bytes at a time.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    You'd probably see a small improvement from testing 64B (one cache line) at a time, by ORing them together and using `ptest` on the result. (`ptest` is 2 uops, and doesn't macro-fuse with branch). Using multiple `por` instructions each iteration should help saturate both load ports, since there's loop overhead. Also note that if size >= 16, the cleanup loop can be a single `ptest` of an unaligned load that ends at the last byte. It's ok that this load overlaps with the last main loop iteration, because testing-for-zero is idempotent. – Peter Cordes Feb 18 '16 at 11:39
  • @PaulR: My mistake, unaligned loads are OK. You might get better performance with an unaligned load for the first batch and then align the pointer and use aligned loads for the rest of the array. – chqrlie Jan 21 '19 at 13:36
  • @chqrlie: what you say is true, but on modern CPUs there is very little throughout impact when using unaligned loads. It’s still worth ensuring alignment where you can, but the penalty for unaligned loads is far less significant than it was on older CPUs. – Paul R Jan 21 '19 at 17:23
2

The Hackers Delight book/site is all about optimized C/assembly. Lots of good references from that site also and is fairly up to date (AMD64, NUMA techniques also).

RandomNickName42
  • 5,923
  • 1
  • 36
  • 35
2

Look at fast memcpy - it can be adapted for memcmp (or memcmp against a constant value).

Vitali
  • 3,411
  • 2
  • 24
  • 25
  • 2
    There is far too little meat here for an answer... this should have been only a comment, or flesh out how to adapt. – Nicu Stiurca Jan 10 '16 at 23:40
2

I see a lot of people saying things about alignment issues preventing you from doing word sized accesses, but that's not always true. If you're looking to make portable code, then this is certainly an issue, however x86 will actually tolerate misaligned accesses. For exmaple this will only fail on the x86 if alignment checking is turned on in EFLAGS (and of course buf is actuallly not word aligned).

int is_empty(char * buf, int size) {
 int i;
 for(i = 0; i < size; i+= 4) {
   if(*(int *)(buf + i) != 0) {
     return 0;
   }   
 }

 for(; i < size; i++) {
   if(buf[i] != 0) 
     return 0;
 }

 return 1;
}

Regardless the compiler CAN convert your original loop into a loop of word-based comparisons with extra jumps to handle alignment issues, however it will not do this at any normal optimization level because it lacks information. For cases when size is small, unrolling the loop in this way will make the code slower, and the compiler wants to be conservative.

A way to get around this is to make use of profile guided optimizations. If you let GCC get profile information on the is_empty function then re-compile it, it will be willing to unroll the loop into word-sized comparisons with an alignment check. You can also force this behavior with -funroll-all-loops

Falaina
  • 6,625
  • 29
  • 31
2

Did anyone mention unrolling the loop? In any of these loops, the loop overhead and indexing is going to be significant.

Also, what is the probability that the buffer will actually be empty? That's the only case where you have to check all of it. If there typically is some garbage in the buffer, the loop should stop very early, so it doesn't matter.

If you plan to clear it to zero if it's not zero, it would probably be faster just to clear it with memset(buf, 0, sizeof(buf)), whether or not it's already zero.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
2

What about looping from size to zero (cheaper checks):

int is_empty(char * buf, int size) 
{
    while(size --> 0) {
        if(buf[size] != 0) return 0;
    }
    return 1;
}

It must be noted that we probably cannot outperform the compiler, so enable the most aggressive speed optimization in your compiler and assume that you're likely to not go any faster.

Or handling everything using pointers (not tested, but likely to perform quite good):

int is_empty(char* buf, int size)
{
    char* org = buf;

    if (buf[size-1] == 1)
        return 0;

    buf[size-1] = 1;
    while(! *buf++);
    buf--;

    return buf == org[size-1];
}
malat
  • 12,152
  • 13
  • 89
  • 158
Aloys
  • 682
  • 4
  • 11
  • I agree with your first remark (looping down to zero). And the other trivial optimization would be to remove the use of indexing, i.e. replace the if-statement with `if (*buf++) return 0;`. – wovano Feb 29 '20 at 18:46
  • Your second approach is interesting! Of course it would only be valid if it's allowed to write to the buffer (which would generally not be the case). And there are some bugs in the code: out-of-bounds access when `size == 0`, the original contents of `buf[size-1]` is not restored, and the last statement misses a `&`. I'd also avoid the unnecessary address calculations for `buf[size-1]` (three times). But I like the thinking out of the box and there could be scenarios where this solution is relatively simple and efficient :) – wovano Feb 29 '20 at 18:47
1

You stated in your question that you are looking for a most likely unnecessary micro-optimization. In 'normal' cases the ASM approach by Thomas and others should give you the fastest results.

Still, this is forgetting the big picture. If your buffer is really large, then starting from the start and essential do a linear search is definitely not the fastest way to do this. Assume your cp replacement is quite good at finding large consecutive empty regions but has a few non-empty bytes at the end of the array. All linear searches would require reading the whole array. On the other hand a quicksort inspired algorithm could search for any non-zero elements and abort much faster for a large enough dataset.

So before doing any kind of micro-optimization I would look closely at the data in your buffer and see if that gives you any patterns. For a single '1', randomly distributed in the buffer a linear search (disregarding threading/parallelization) will be the fastest approach, in other cases not necessarily so.

Alexander
  • 2,174
  • 3
  • 16
  • 25
1

Inline assembly version of the initial C code (no error checking, if uiSize is == 0 and/or the array is not allocated exceptions will be generated. Perhaps use try {} catch() as this might be faster than adding a lot of check to the code. Or do as I do, try not to call functions with invalid values (usually does not work). At least add a NULL pointer check and a size != 0 check, that is very easy.

 unsigned int IsEmpty(char* pchBuffer, unsigned int uiSize)
 {
    asm {
      push esi
      push ecx         

      mov esi, [pchBuffer]
      mov ecx, [uiSize]

      // add NULL ptr and size check here

      mov eax, 0

    next_char:
      repe scasb           // repeat string instruction as long as BYTE ptr ds:[ESI] == 0
                           // scasb does pointer arithmetic for BYTES (chars), ie it copies a byte to al and increments ESI by 1
      cmp cx,0             // did the loop complete?
      je all_chars_zero    // yes, array is all 0
      jmp char_not_zero    // no, loop was interrupted due to BYTE PTR ds:[ESI] != 0

    all_chars_zero:        
      mov eax, 1           // Set return value (works in MASM)
      jmp end  

    char_not_zero:
      mov eax, 0          // Still not sure if this works in inline asm

    end:
      pop ecx
      pop esi          
  }
}

That is written on the fly, but it looks correct enough, corrections are welcome. ANd if someone known about how to set the return value from inline asm, please do tell.

james jelo4kul
  • 839
  • 4
  • 17
phazer
  • 89
  • 7
  • 2
    `repe scasb` is not fast on any modern x86 CPU. https://agner.org/optimize/. Also, `repe scasb` leaves flags set (unless ECX was zero, in which case it does zero `scasb`). All you need is a `setz al` with no branching. Or if you do look at ECX, look at the whole ECX, not just CX. You also don't need to push/pop regs yourself; the compiler will save/restore them around the whole function if needed. (Not needed for ECX; it's call-clobbered.) – Peter Cordes Sep 26 '18 at 04:42
  • @PeterCordes, any comparisons on rep prefix vs other approach? – Sergei Krivonos Nov 19 '19 at 16:58
  • Using AVX `vmovaps` + `vorps` + `vptest` + `jnz` to check a 64-byte cache line at about 64 bytes per clock cycle on Haswell / Skylake, or Zen 2. (Might need to unroll more for 128-byte chunks to avoid a front-end bottleneck actually). Using `rep scasq`, you can check 8 bytes / clock. `rep scasb` can check 1 byte per clock on modern Intel like Skylake. – Peter Cordes Nov 19 '19 at 18:07
-2
int is_empty(char * buf, int size) 
{
    int i, content=0;  
    for(i = 0; !content && i < size; i++)    
    {  
        content=content | buf(i);       // bitwise or  
    }  
    return (content==0);  
}
Christian Rau
  • 45,360
  • 10
  • 108
  • 185
Emilio M Bumachar
  • 2,532
  • 3
  • 26
  • 30
  • a) Please use formatting. b) `||` is not bitwise OR, it's logical OR. – Chris Lutz Sep 30 '09 at 03:18
  • c) For efficiency, change your `for()` loop to `for(i = 0; !content && i < size; i++)` so that it won't keep looping once bad content is found. – Chris Lutz Sep 30 '09 at 03:20
  • If you're going to check `content` every iteration, there's no point bitwise-ORing into it. bitwise-OR of a whole aligned block of data makes sense to do less branching, or let the compiler maybe optimize into asm that checks 4 or 8 bytes at a time, but this isn't useful. – Peter Cordes Sep 26 '18 at 04:47
-3
int is_empty(char * buf, int size)
{
   return buf[0] == '\0';
}

If your buffer is not a character string, I think that's the fastest way to check...

memcmp() would require you to create a buffer the same size and then use memset to set it all as 0. I doubt that would be faster...

Kieveli
  • 10,944
  • 6
  • 56
  • 81
  • 4
    `memcmp()` doesn't _require_ that. You could make a really big static buffer initialized with `{ 0 }` and compare only portions of that to the buffer in question. – Chris Lutz Sep 29 '09 at 17:35
  • 1
    A really big buffer of all-zero would perform *much* better if you allocate it with `calloc` on a system where new pages from the OS start out copy-on-write mapped to the same physical zeroed page. That way the entire `calloc` buffer where is hot in L1 cache. You'll still get TLB misses, though. The only reason `memcmp` is even worth considering is that current compilers really suck at compiling loops that exit after finding something. They don't auto-vectorize, and byte-at-a-time loops usually stay byte-at-a-time. You can get OK results by ORing some `size_t`s together, though. – Peter Cordes Feb 18 '16 at 11:46
-3

Edit: Bad answer

A novel approach might be

int is_empty(char * buf, int size) {
    char start = buf[0];
    char end = buff[size-1];
    buf[0] = 'x';
    buf[size-1] = '\0';
    int result = strlen(buf) == 0;
    buf[0] = start;
    buff[size-1] = end;
    return result;
}

Why the craziness? because strlen is one of the library function that's more likely to be optimized. Storing and replacing the first character is to prevent the false positive. Storing and replacing the last character is to make sure it terminates.

Calyth
  • 1,673
  • 3
  • 16
  • 26
  • 1
    This won't work for the binary data `0 0 1 0` among others. It doesn't check whether or not a string is empty. – Chris Lutz Sep 29 '09 at 18:01
-4

The initial C algorithm is pretty much as slow as it can be in VALID C. If you insist on using C then try a "while" loop instead of "for":

     int i = 0;
     while (i< MAX)
     {
        // operate on the string
        i++;
     }

This is pretty much the fastest 1 dimensional string operation loop you can write in C, besides if you can force the compiler to put i in a register with the "register" keyword, but I am told that this is almost always ignored by modern compilers.

Also searching a constant sized array to check if it is empty is very wasteful and also 0 is not empty, it is value in the array.

A better solution for speed would to use a dynamic array (int* piBuffer) and a variable that stores the current size (unsigned int uiBufferSize), when the array is empty then the pointer is NULL, and uiBufferSize is 0. Make a class with these two as protected member variables. One could also easily write a template for dynamic arrays, which would store 32 bit values, either primitive types or pointers, for primitive types there is not really any way to test for "empty" (I interpret this as "undefined"), but you can of course define 0 to represent an available entry. For an array pointers you should initialize all entries to NULL, and set entry to NULL when you have just deallocated that memory. And NULL DOES mean "points at nothing" so this is very convenient way to represent empty. One should not use dynamically resized arrays in really complicated algorithms, at least not in the development phase, there are simply too many things that can go wrong. One should at least first implement the algorithm using an STL Container (or well tested alternative) and then when the code works one can swap the tested container for a simple dynamic array (and if you can avoid resizing the array too often the code will both be faster and more fail safe.

A better solution for complicated and cool code is to use either std::vector or a std::map (or any container class STL, homegrown or 3rd party) depending on your needs, but looking at your code I would say that the std::vector is enough. The STL Containers are templates so they should be pretty fast too. Use STL Container to store object pointers (always store object pointers and not the actual objects, copying entire objects for every entry will really mess up your execution speed) and dynamic arrays for more basic data (bitmap, sound etc.) ie primitive types. Generally.

I came up with the REPE SCASW solution independtly by studying x86 assembly language manuals, and I agree that the example using this string operation instruction is the fastest. The other assembly example which has separate compare, jump etc. instructions is almost certainly slower (but still much faster than the initial C code, so still a good post), as the string operations are among the most highly optimized on all modern CPUs, they may even have their own logic circuitry (anyone knows?).

The REPE SCASD does not need to fetch a new instruction nor increase the instruction pointer, and that is just the stuff an assembly novice like me can come up with and and on top of that is the hardware optimization, string operations are critical for almost all kinds of modern software in particular multimedia application (copy PCM sound data, uncompressed bitmap data, etc.), so optimizing these instructions must have been very high priority every time a new 80x86 chip was being designed. I use it for a novel 2d sprite collision algorithm.

It says that I am not allowed to have an opinion, so consider the following an objective assessment: Modern compilers (UNMANAGED C/C++, pretty much everything else is managed code and is slow as hell) are pretty good at optimizing, but it cannot be avoided that for VERY specific tasks the compiler generates redundant code. One could look at the assembly that the compiler outputs so that one does not have to translate a complicated algorithm entirely from scratch, even though it is very fun to do (for some) and it is much more rewarding doing code the hard way, but anyway, algorithms using "for" loops, in particular with regards to string operations, can often be optimized very significantly as the for loop generates a lot of code, that is often not needed, example: for (int i = 1000; i>0; i--) DoSomething(); This line generates at 6-10 lines of assembly if the compiler is not very clever (it might be), but the optimized assembly version CAN be:

      mov cx, 1000
    _DoSomething:
      // loop code....or call Func, slower but more readable
      loop _DoSomething

That was 2 lines, and it does exactly the same as the C line (it uses registers instead of memory addresses, which is MUCH faster, but arguably this is not EXACTLY the same as the C line, but that is semantics) , how much of an optimization this example is depends on how well modern compilers optimize, which I have no clue on, but the algorithm analysis based on the goal of implementing an algorithm with the fewest and faster assembly lines often works well, I have had very good results with first implementing the algorithm in C/C++ without caring about optimization and then translate and optimize it in assembly. The fact that each C line becomes many assembly lines often makes some optimizations very obvious, and also some instructions are faster than others:

      INC DX ; is faster than:
      ADD DX,1 ;if ADD DX,1 is not just replaced with INC DX by the assembler or the CPU
      LOOP ; is faster than manually decreasing, comparing and jumping
      REPxx STOSx/MOVSx/LODSx is faster than using cmp, je/jne/jea etc and loop
      JMP or conditional jumping is faster than using CALL, so in a loop that is executed VERY frequently (like rendering), including functions in the code so it is accessible with "local" jumps can also boost performance.

The last bit is very relevant for this question, fast string operations. So this post is not all rambling.

And lastly, design you assembly algorithm in the way that requires the least amount of jumps for a typical execution.

Also don't bother optimizing code that is not called that often, use a profiler and see what code is called most often, and start with that, anything that is called less than 20 times a second (and completes much faster than 1000 ms/ 20) is not really worth optimizing. Look at code that it not synchronized to timers and the like and is executed again immediately after is has completed. On the other hand if your rendering loop can do 100+ FPS on a modest machine, it does not make sense economically to optimize it, but real coders love to code and do not care about economics, they optimize the AppStart() method into 100% assembly even though it is only called once :) Or use a z rotation matrix to rotate Tetris pieces 90 degrees :P Anyone who does that is awesome!

If anyone has some constructive correction, which is not VERY hurtful, then I would love to hear it, I code almost entirely by myself, so I am not really exposed to any influences. I once paid a nice Canadian game developer to teach my Direct3d and though I could just as easily have read a book, the interaction with another coder who was somewhat above my level in certain areas was fun.

Thanks for good content generally. I think I will go and answer some of the simpler questions, give a little back.

phazer
  • 89
  • 7
  • 1
    Don't use 16bit operand-size in your asm. And never use the `loop` instruction unless you're optimizing for code size at the expense of speed. Also, `repe scasd` isn't faster than a normal loop. There's no reason to use `repe scasw`: if you're going to handle working in larger chunks (alignment, reading past the end, etc.), then use 4B chunks (or 8B in 64bit mode). Or 16B with SSE2 vectors instructions. See http://agner.org/optimize/ – Peter Cordes Jan 28 '16 at 07:12
  • 1
    Also, `for` loops don't compile different from `while` loops. And the `register` keyword does (almost?) nothing with gcc. It's already good at keeping as much as possible live in registers, minimizing spills to memory. Your comment that the OP's code "is as slow as it could be with valid C" is totally wrong. It looks fine to me. I didn't plug it in to http://gcc.godbolt.org/, but I'd expect a `while` version of it to compile to identical asm instructions. – Peter Cordes Jan 28 '16 at 07:13