How to efficiently compare a block of memory to a single byte?

Question

I'm trying to see if a struct came back as all 0xFF for the size of the struct.

memcmp seems like the obvious starting point, BUT I'd have to allocate a second memory block, populate it with 0xFF's. It just seems like a waste.

Does a standard function exist for this? Or should I just punt and iterate via a for loop?

you can try to compare by words. whether this is more efficient is something you have to measure. — Karoly Horvath, Aug 06 '14 at 17:08
Depending on the size of the struct, statically allocating an instance filled with 0xFF (ie: let the compiler and loader do this once off), and using `memcmp` should allow for the best method as the compiler would be able to specialize for a known/constant size. — Necrolis, Aug 06 '14 at 17:30
With @Karoly on this. Make sure to test the first byte(s) "manually" if they are not aligned on a DWORD, divide the remaining by 4 (or 8, if your system can handle 64 bit words), and let it rip. Search for "optimized memory copy" for hints on this. — Jongware, Aug 06 '14 at 17:30
Methinks some pointer arithmetic and masks should do the trick. Look out for struct packing and use sizeof. — Sithregal, Aug 06 '14 at 18:25

score 5 · Accepted Answer · answered Aug 06 '14 at 17:17

5

The most obvious solution here seems to be to simply loop over the size of the struct and compare it byte-by-byte.

The approach of allocating a block of 0xFF followed by memcmp should achieve the same with but a higher space complexity.

answered Aug 06 '14 at 17:17

Roney Michael

3,964
5
30
45

Basile Starynkevitch · Answer 2 · 2014-08-06T17:33:36.543

4

I know no standard function for this.

I don't think that memcmp is always the right choice (it needs twice the memory bandwith).

I would write an iteration (even a very naive one). Most compilers optimize that very well (when asked). So they probably unroll your loops and may do word compares (even if you coded a naive byte iteration).

You might code specialized openmp variants (at least on GCC). See http://openmp.org/

If the structure is big (e.g. dozens of kilobytes, because of the cost of GPGPU <-> RAM data copy) and if you have lot of development time to waste, consider perhaps OpenCL (in particular if you have specialized hardware supporting it, e.g. GPGPUs). It might never worth the cost (unless you do something -which does not require a lot of memory bandwidth- on the CPU while the GPGPU is working)

I would code the naive loop, and won't bother optimizing by hand (unless benchmarking of compiler-optimized code suggests otherwise), because the bottleneck is probably the memory bandwidth.

edited Aug 06 '14 at 17:33

answered Aug 06 '14 at 17:09

Basile Starynkevitch

223,805
18
296
547

1

On modern CPUs, well, on anything found on the desktop in the last 15 years, really, the block comparison against a constant can be coded to run at the memory bandwidth. Having a GPGPU do it is counterproductive, since by the time the system memory has been read, a single core would have finished the comparison. At that point a GPGPU is merely ready to start the work. Never mind the latencies of actually getting the GPGPU to start the task, and then getting the result back. – Kuba hasn't forgotten Monica Aug 06 '14 at 17:21
Agreed, this is why I mentioned GPGPUs only for large block of data. – Basile Starynkevitch Aug 06 '14 at 17:22
4

I claim that the use of GPGPU for *any* block of data is counterproductive by definition, unless all of your CPU cores are busy. If you're looking at latency of things, with an available single core, running the block comparison on that single core of the CPU will be done at memory bandwidth. IOW, it can't be done any faster in terms of wall time by splitting the work onto multiple cores, nor can it be done any faster by using the GPGPU. You simply can't get the data out of the memory any faster than that single core can swallow it. – Kuba hasn't forgotten Monica Aug 06 '14 at 17:23
Don't GPU:s often have higher memory bandwidth than CPU cores? – Tor Klingberg Aug 12 '14 at 16:37

score 3 · Answer 3 · answered Aug 06 '14 at 17:35

The logical name for such a function would be memcchr - it is to memchr as strcspn is to strspn.

And look here: google results for memcchr show that it has been implemented under that name as part of the FreeBSD kernel, and they've made some attempt at optimizing it beyond the obvious 1-byte-at-a-time loop.

It will probably take some additional work to make this function suitable for use in any program other than the FreeBSD kernel.

escrafford · Answer 4 · 2014-08-06T17:22:34.390

2

There is memchr(), which does the opposite of what you're asking for - search for the first occurrence of a byte within a mem block. afaik, there isn't a standard function to search for a byte that doesn't match a specific one. for loop sounds like the way to go. Maybe go 32/64bits at a time to speed it up.

-- An extra bit of not-answer: memcmp is going to be slower than a for loop. First, you'll need to fill a block of memory the same size as your original block (this portion will probably take as long as a naive for loop). Then you need to read each memory block into registers to compare them. A for loop will have a value in a register and just read in one memory block to compare with the unchanging register.

edited Aug 06 '14 at 17:22

answered Aug 06 '14 at 17:08

escrafford

2,373
16
19

To quickly fill a block of a known size with a single byte, use `memset`. (I *do* agree it must be slower than a naive loop, which, being intrinsically simple, surely will be optimized as much as possible.) – Jongware Aug 06 '14 at 17:27
As Kuba has hinted at in his comments, modern cpus are faster than ram (ram will be the bottleneck). Accessing ram 3x the size of your buffer will be ~3x slower than accessing it 1x, regardless of how well you optimize your for loop. – escrafford Aug 06 '14 at 17:33

Erik · Answer 5 · 2014-08-06T17:46:32.770

I don't know if this would help with performance a lot, but you could follow this algorithm:

Compare the 1st byte of the struct with 1 byte of allocated memory 0xFF
Compare the 2nd byte of the struct with the 1st byte of the struct
Compare bytes 3-4 of the struct with bytes 1-2 of the struct
Compare bytes 5-8 of the struct with bytes 1-4 of the struct

And continue in the same fashion until the end of the struct. If at any point the statement is false, you know the struct isn't all 0xFF. You would also need to handle it differently when the remaining portion of the struct is smaller than the first part checked, but that should be relatively simple.

In the end you've allocated 1 extra byte of memory and the algorithm is O(log n) (a slight improvement on what I've seen in the answers so far).

edit: As escrafford mentioned below, if you substitute "byte" for "word" in the above portion, it may run a little faster. I can't comment on how much speed you might gain, but it would increase the extra memory stored (albeit by a tiny amount on today's computers).

I think this is the only solution posted here that has a chance of being faster than a naive for loop. My only addition would be to use register sized comparisons (32/64bit). — escrafford, Aug 06 '14 at 17:37

score 0 · Answer 6 · edited May 23 '17 at 12:23

A dirty re-write of the code in Why does this implementation of strlen() work?. Did some quick tests; no guarantees.

This should return the number of 0xFF bytes; if it equals the number you started it with, you are in the safe. (Of course you could just let it return 0 or 1 as well.) Remove the printfs when satisfied.

#define LONGPTR_MASK (sizeof(long) - 1)

int find_no_ff (const char *memory, size_t length)
{
    const char *p;
    const unsigned long *lp;
    size_t remain = length, to_do;

    printf ("non-aligned, start:\n");
    /* Test the first few bytes until we have an aligned p */
    for (p = memory; (uintptr_t)p & LONGPTR_MASK; p++)
    {
        printf ("testing %02X\n", *p & 0xff);
        if (*p != '\xFF')
            return (p - memory);
        remain--;
    }

    printf ("passed.\n");

    printf ("aligned:\n");
    to_do = remain/sizeof(long);
    remain -= (to_do*sizeof(long));

    /* Scan the rest of the string using word sized operation */
    for (lp = (const unsigned long *)p; to_do--; lp++)
    {
        printf ("testing %08lX\n", *lp);
        if (*lp +1)
            return p - memory;
    }
    printf ("passed.\n");

    p = (const char *)lp;

    printf ("non-aligned, end:\n");
    /* Test the last bytes until we have an aligned p */
    while (remain--)
    {
        printf ("testing %02X\n", *p & 0xff);
        if (*p != '\xFF')
            return (p - memory);
        p++;
    }
    printf ("passed.\n");
    return p - memory;
}

int main (void)
{
    char data[] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };

    printf ("input size: %ld\n", sizeof(data));
    printf ("test result: %d\n", find_no_ff (data, sizeof(data)));

    return 0;
}

score 0 · Answer 7 · answered May 05 '18 at 22:55

I like Erik's suggestion, but it can be simplified in an interesting way as follows (not tested):

if((*pBytes == 0xFF) && (memcmp(pBytes, pBytes + 1, byteCount - 1) == 0)) // The byteCount bytes at pBytes are 0xFFs.

The condition will be true only if A) the first byte is 0xFF, and B) every other byte is equal to the byte before it. The combination means every byte is 0xFF.

score 0 · Answer 8 · answered Nov 08 '22 at 14:32

0

Take a look into strspn function. But you need to place '\0' in the first byte after the struct under test in order to be able to use this function.

answered Nov 08 '22 at 14:32

user38759

96
1
3

How to efficiently compare a block of memory to a single byte?

8 Answers8