75

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?

I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?

Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.

edit2: I revisited this issue and did a little testing. Here is what I tested:

    #include <stdio.h>
    #include <malloc.h>
    #include <string.h>
    #include <sys/time.h>

    #define TIME(body) do {                                                     \
        struct timeval t1, t2; double elapsed;                                  \
        gettimeofday(&t1, NULL);                                                \
        body                                                                    \
        gettimeofday(&t2, NULL);                                                \
        elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
        printf("%s\n --- %f ---\n", #body, elapsed); } while(0)                 \


    #define SIZE 0x1000000

    void zero_1(void* buff, size_t size)
    {
        size_t i;
        char* foo = buff;
        for (i = 0; i < size; i++)
            foo[i] = 0;

    }

    /* I foolishly assume size_t has register width */
    void zero_sizet(void* buff, size_t size)
    {
        size_t i;
        char* bar;
        size_t* foo = buff;
        for (i = 0; i < size / sizeof(size_t); i++)
            foo[i] = 0;

        // fixes bug pointed out by tristopia
        bar = (char*)buff + size - size % sizeof(size_t);
        for (i = 0; i < size % sizeof(size_t); i++)
            bar[i] = 0;
    }

    int main()
    {
        char* buffer = malloc(SIZE);
        TIME(
            memset(buffer, 0, SIZE);
        );
        TIME(
            zero_1(buffer, SIZE);
        );
        TIME(
            zero_sizet(buffer, SIZE);
        );
        return 0;
    }

results:

zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.

For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)

edit3: fixed bug in test code, test results are not affected

edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.

user5534993
  • 518
  • 2
  • 17
maep
  • 1,318
  • 1
  • 12
  • 15
  • 6
    Instead of assuming that `memset` uses `mov`, why don't you try disassembling the output of your compiler? Different compilers will do different things. If `xor` is faster on a given architecture, then it wouldn't be surprising if some compilers optimize `memset(ptr, 0, nbytes)` into `xor` instructions. – Laurence Gonsalves Sep 06 '10 at 23:57
  • 15
    I am not aware of a compiler that uses XOR to zero memory. Maybe a register, but not memory. In order to use XOR to zero memory, you have first read memory, then XOR, then write memory. – President James K. Polk Sep 07 '10 at 00:00
  • 5
    If appropriate, `calloc` might be effectively free, because the implementation might zero pages ahead of time, while the CPU is otherwise idle. Does that count? ;-) – Steve Jessop Sep 07 '10 at 00:01
  • @Greg Intel compiler likes to initialize registers with XOR – Anycorn Sep 07 '10 at 00:30
  • 1
    @aaa carp, GregS is right that using an XOR to clear memory will almost never be the right answer. The exception that pops to mind is the smallest Microchip PIC CPUs, where all available RAM is really just a large field of general purpose registers. Clearing a register with an XOR is common in CISC architectures. Some RISC architectures effectively include a general purpose register that is hardwired to hold zero just for the purpose. – RBerteig Sep 07 '10 at 02:09
  • 1
    Your `zero_sizet` cheats by not doing all the work `memset` does. They're only equivalent if your copied sized are mutliple of `sizeof (size_t)`. In the other cases `memset` has also to null the 1 to 3(7) bytes remaining. Another problem may arise if your address is not aligned, a good `memset` implementation will try to align its `size_t *` before looping. This penalizes a little bit in the good case, but wins a lot in the odd case. A good implementation will also chose 64 bit alignment as the memory bus on PC is 64 bits since the time of Pentium 1. – Patrick Schlüter Jul 11 '11 at 13:19
  • @tristopia you are right, I missed that. As far as my experience goes alignment doesn't seem to be big issue with modern x86 CPUs. Like I said, I tried a few tricks but always got in the way of the compiler. – maep Jul 20 '11 at 20:57
  • I realize I am a little late to the party, but if you *really* care about zero-ing a ton of memory really fast *and* you are on POSIX *and* you know that your memory is 4k page-aligned (or are willing to make it so), `madvise` with `MADV_DONTNEED` will zero out huge sections of memory very quickly. Probably worth noting that this is highly platform-dependent, so check how your kernel reacts to this operation first :-) – Travis Gockel May 03 '13 at 20:43
  • 2
    @TravisGockel Also highly unwise: `madvise()` can just be a no-op. It might work on a *specific kernel and libc version*, but sounds like it could easily break if you upgrade either. – tc. May 14 '13 at 15:18
  • *"I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way..."* - the other question to ask, is ***`memset(ptr, 0, nbytes)`*** guaranteed to survive optimization. This use case is zeroizing sensitive material in memory. The answer to that is ***NO***. `memset_s` is guaranteed to survive optimizations, but the Glibc folks refuse to provide it. Also see [Issue 17879: Library is missing memset_s](https://sourceware.org/bugzilla/show_bug.cgi?id=17879). – jww Feb 27 '16 at 14:51
  • try reverse looping --i – Edgehog.net Sep 30 '16 at 13:28

9 Answers9

40

x86 is rather broad range of devices.

For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.

For chips with mmx, an assembly loop with movq could hit 64bits at a time.

You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.

for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps

Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.

Tim
  • 466
  • 5
  • 2
30

memset is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).

That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset implementation by doing it yourself. memset and its friends in the standard library are always fun targets for one-upmanship programming. :)

Ben Zotto
  • 70,108
  • 23
  • 141
  • 204
  • 2
    Also: memset could in theory have a special case for 0 which is selected at compile-time (either by inlining or as an intrinsic operation) when that argument is a literal. Don't know whether anyone does or not. – Steve Jessop Sep 06 '10 at 23:58
  • 2
    @Steve Jessop: Interesting idea (esp that it could be compile-time). I remember reading someone's maverick implementation of memset once that had special cases for just about everything you'd actually use memset for. – Ben Zotto Sep 07 '10 at 00:05
  • 34
    gcc typically uses an inline builtin implementation of `memset()`. Funnily enough, I remember reading about a buggy implementation of `memset()` that always set the value to 0 - and this wasn't noticed for *years*, because apparently the vast majority of time `memset()` is used to set to zero! – caf Sep 07 '10 at 00:50
  • 2
    *"`memset` is generally designed to be very very fast general-purpose setting/zeroing code..."* - I don't think that's quite correct. There are no guarantees `memset` will survive the optimization pass, so zeroization may not occur. `memset_s` makes provides that guarantee, but the Glibc folks refuse to provide it. Also see [Issue 17879: Library is missing memset_s](https://sourceware.org/bugzilla/show_bug.cgi?id=17879). – jww Feb 27 '16 at 13:41
24

Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to memset away (better check the assembler, though).

Then also, avoid memset if you don't have to:

  • use calloc for heap memory
  • use proper initialization (... = { 0 }) for stack memory

And for really large chunks use mmap if you have it. This just gets zero initialized memory from the system "for free".

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
  • No, last time I checked, gcc didn't. However g++ does optimize away a call to `std::fill` *(unless there's an optmization `-ftree-loop-distribute-patterns` enabled, in which case it also becomes a call to memset)*, which is C++ analog of memset. – Hi-Angel Sep 26 '15 at 16:08
  • 1
    Perhaps worth a mention: I just made a tests, and found a wonderful thing: with the `-ftree-loop-distribute-patterns`, which changes `std∷fill` to `memset` the program ×10 (!) times faster than without, i.e. when `std∷fill` is inlined by g++, and even if I'm adding `march=native`. Hence gcc-4.9.2 aren't that good in optimizations, because that means that there is a way to optimize `std∷fill` even more. Btw, I did also a test with clang, and I found that it is worse optimizes — with `-O3` level it doesn't even removes `push-pop` from the code. – Hi-Angel Sep 26 '15 at 17:12
  • 1
    Well, the "for free" comment isn't totally true. The initialization of memory to zero simply happens in the OS rather then under your program control. Who knows who wrote the mmap function, and it IT efficient? If time critical is important, better to get the uninitialized mem, then clear it yourself with an assembler routine. – Cool Javelin Feb 25 '17 at 01:44
  • The compiler provides intrinsics for most basic functions, including this. Library functions are expected to use the intrinsics from the compiler, when available. – Igor Stoppa Sep 18 '18 at 13:55
6

If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.

The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.

For what it is worth, I hope it helps.

Sparky
  • 13,505
  • 4
  • 26
  • 27
6

Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.

For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a call to the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.

That said, if you have specific needs (where size will always be tiny *or* huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.

But it all depends - and for normal stuff, please stick to memset/memcpy :)

snemarch
  • 4,958
  • 26
  • 38
  • 1
    Even old gcc implementations on sparc replaced `memcpy` and `memset` calls mith `mov instruction`s when the sizes were known at compile time and not too big. – Patrick Schlüter Jul 11 '11 at 13:41
5

The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.

The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.

Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.

For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).

Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).

bta
  • 43,959
  • 6
  • 69
  • 99
3

That's an interesting question. I made this implementation that is just slightly faster (but hardly measurable) when 32-bit release compiling on VC++ 2012. It probably can be improved on a lot. Adding this in your own class in a multithreaded environment would probably give you even more performance gains since there are some reported bottleneck problems with memset() in multithreaded scenarios.

// MemsetSpeedTest.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>
#include "Windows.h"
#include <time.h>

#pragma comment(lib, "Winmm.lib") 
using namespace std;

/** a signed 64-bit integer value type */
#define _INT64 __int64

/** a signed 32-bit integer value type */
#define _INT32 __int32

/** a signed 16-bit integer value type */
#define _INT16 __int16

/** a signed 8-bit integer value type */
#define _INT8 __int8

/** an unsigned 64-bit integer value type */
#define _UINT64 unsigned _INT64

/** an unsigned 32-bit integer value type */
#define _UINT32 unsigned _INT32

/** an unsigned 16-bit integer value type */
#define _UINT16 unsigned _INT16

/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8

/** maximum allo

wed value in an unsigned 64-bit integer value type */
    #define _UINT64_MAX 18446744073709551615ULL

#ifdef _WIN32

/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);

/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT clock_t start;double diff;

/** Use to start the performance timer */
#define TIMER_START start=clock();

/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif    


void *MemSet(void *dest, _UINT8 c, size_t count)
{
    size_t blockIdx;
    size_t blocks = count >> 3;
    size_t bytesLeft = count - (blocks << 3);
    _UINT64 cUll = 
        c 
        | (((_UINT64)c) << 8 )
        | (((_UINT64)c) << 16 )
        | (((_UINT64)c) << 24 )
        | (((_UINT64)c) << 32 )
        | (((_UINT64)c) << 40 )
        | (((_UINT64)c) << 48 )
        | (((_UINT64)c) << 56 );

    _UINT64 *destPtr8 = (_UINT64*)dest;
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 2;
    bytesLeft = bytesLeft - (blocks << 2);

    _UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll;

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 1;
    bytesLeft = bytesLeft - (blocks << 1);

    _UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll;

    if (!bytesLeft) return dest;

    _UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
    for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll;

    return dest;
}

int _tmain(int argc, _TCHAR* argv[])
{
    TIMER_INIT

    const size_t n = 10000000;
    const _UINT64 m = _UINT64_MAX;
    const _UINT64 o = 1;
    char test[n];
    {
        cout << "memset()" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                memset((void*)test, 0, n);  

        TIMER_STOP;
    }
    {
        cout << "MemSet() took:" << endl;
        TIMER_START;

        for (int i = 0; i < m ; i++)
            for (int j = 0; j < o ; j++)
                MemSet((void*)test, 0, n);

        TIMER_STOP;
    }

    cout << "Done" << endl;
    int wait;
    cin >> wait;
    return 0;
}

Output is as follows when release compiling for 32-bit systems:

memset() took:
5.569000
MemSet() took:
5.544000
Done

Output is as follows when release compiling for 64-bit systems:

memset() took:
2.781000
MemSet() took:
2.765000
Done

Here you can find the source code Berkley's memset(), which I think is the most common implementation.

  • 1
    You should really be using `QueryPerformanceTimer()` family of functions. – quantum Mar 13 '13 at 00:03
  • @xiaomao Those functions only gives you more millisecond accuracy, but I'll humor you. On 32-bit Windows the result were: `memset() took: 5575.6 ms. MemSet() took: 5539.97 ms.`, and on 64-bit Windows the result were: `memset() took: 2778.73 ms. MemSet() took: 2774.18 ms.` –  Mar 13 '13 at 10:56
  • 2
    `const _UINT64 m = _UINT64_MAX` looks suspicious since you later compare it to an `int`, `j < o` should be optimized away by good compiler, and you're being unfair to `memset()` since it is also what warms up the array `test`. – tc. May 14 '13 at 15:15
  • @tc. "what warms up the array", yes, would be interesting to run tests where the two methods were alternating multiple times in a single runtime. I could possibly look at it when I get home. – Max Sep 02 '20 at 13:30
3

There is one fatal flaw in this otherwise great and helpful test: As memset is the first instruction, there seems to be some "memory overhead" or so which makes it extremely slow. Moving the timing of memset to second place and something else to first place or simply timing memset twice makes memset the fastest with all compile switches!!!

Chris
  • 31
  • 1
  • Thanks for pointing that out. I originally ran the test on a Atom, but now I have only access to a PPC. At least on this machine I can confirm what you say. My guess it that the cache is doing it's magic. I guess I have to move to assembler now. – maep Aug 19 '11 at 14:05
-1

memset could be inlined by compiler as a series of efficient opcodes, unrolled for a few cycles. For very large memory blocks, like 4000x2000 64bit framebuffer, you can try optimizing it across several threads (which you prepare for that sole task), each setting its own part. Note that there is also bzero(), but it is more obscure, and less likely to be as optimized as memset, and the compiler will surely notice you pass 0.

What compiler usually assumes, is that you memset large blocks, so for smaller blocks it would likely be more efficient to just do *(uint64_t*)p = 0, if you init large number of small objects.

Generally, all x86 CPUs are different (unless you compile for some standardized platform), and something you optimize for Pentium 2 will behave differently on Core Duo or i486. So if you really into it and want to squeeze the last few bits of toothpaste, it makes sense to ship several versions your exe compiled and optimized for different popular CPU models. From personal experience Clang -march=native boosted my game's FPS from 60 to 65, compared to no -march.