clearing a small integer array: memset vs. for loop

Question

There are two ways to zero out an integer/float array:

memset(array, 0, sizeof(int)*arraysize);

or:

for (int i=0; i <arraysize; ++i)
    array[i]=0;

obviously, memset is faster for large arraysize. However, at what point is the overhead of memset actually larger than the overhead of the for loop? For example, for an array of size 5 - which would be best? The first, the 2nd, or maybe even the un-rolled version:

array[0] = 0;
array[1] = 0;
array[2] = 0;
array[3] = 0;
array[4] = 0;

well I can't really see the interesting part. The only real answer will be through benchmarking the three versions on a given platform/compiler. Moreover knowing the answer is not really useful, no ? But for the sake of writting efficiently the benchmak code ;) — neuro, Jul 15 '09 at 21:17
yea i plan on doing this myself. but now the whole internet will know! (at least for windows with gcc from mingw...) — Claudiu, Jul 15 '09 at 21:23
i'm working with someone elses code which uses for loops everywhere. i'm trying to optimize it, so i'm converting them to memsets. i was wondering if that's the right thing to do — Claudiu, Jul 15 '09 at 21:38
That's only for initializing at the declaration site. The forms in the question can be used at any time to clear an existing array. — nobody, Jul 16 '09 at 01:29

score 49 · Accepted Answer · edited Jul 16 '09 at 01:12

49

In all likelihood, memset() will be inlined by your compiler (most compilers treat it as an 'intrinsic', which basically means it's inlined, except maybe at the lowest optimizations or unless explicitly disabled).

For example, here are some release notes from GCC 4.3:

Code generation of block move (memcpy) and block set (memset) was rewritten. GCC can now pick the best algorithm (loop, unrolled loop, instruction with rep prefix or a library call) based on the size of the block being copied and the CPU being optimized for. A new option -minline-stringops-dynamically has been added. With this option string operations of unknown size are expanded such that small blocks are copied by in-line code, while for large blocks a library call is used. This results in faster code than -minline-all-stringops when the library implementation is capable of using cache hierarchy hints. The heuristic choosing the particular algorithm can be overwritten via -mstringop-strategy. Newly also memset of values different from 0 is inlined.

It might be possible for the compiler to do something similar with the alternative examples you gave, but I'd bet it's less likely to.

And it's grep-able and more immediately obvious at a glance what the intent is to boot (not that the loop is particularly difficult to grok either).

edited Jul 16 '09 at 01:12

Sinan Ünür

116,958
15
196
339

answered Jul 15 '09 at 21:16

Michael Burr

333,147
50
533
760

10

This is a great example of the often-heard "the compiler is better than you at optimizing". Few application programmers would spend this amount of attention on a single call (and if they did, their actual application would likely suffer). :) – unwind Jul 16 '09 at 08:59
2

unwind, if you think 'the compiler is better than you' is something that you should follow, check this out - http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/ – LiraNuna Jul 30 '09 at 21:55
1

@LiraLuna - that's a quite interesting article, but I think you'd agree that memset()/memcpy() are in a different class than SSE intrinsics in terms of how much work has gone into compiler code generation. Also, you'd only want to do the level of analysis that you did on code that's truely performance critical (or maybe as an academic exercise), and therefore worthy of your in-depth attention - not for each and every buffer copy or clear. – Michael Burr Jul 30 '09 at 22:44
1

@LiraNuna: update 6 years later: clang currently has more powerful SIMD intrinsics optimization than gcc. especially for shuffles, clang doesn't even care what intrinsics you used in the first place; it just makes its own choice what instructions to emit to achieve the same shuffle result. This can actually [be detrimental when you try to hand-tune a sequence and clang's current choices are imperfect](http://stackoverflow.com/a/35270026/224132), but in most cases it's great. (And in future, when clang/LLVM's optimizer knows the optimal shuffles for more cases, it won't be a problem.) – Peter Cordes Jul 10 '16 at 17:14

score 24 · Answer 2 · answered Jul 15 '09 at 21:54

24

As Michael already noted, gcc and I guess most other compilers optimize this already very well. For example gcc turns this

char arr[5];
memset(arr, 0, sizeof arr);

into

movl  $0x0, <arr+0x0>
movb  $0x0, <arr+0x4>

It doesn't get any better than that...

answered Jul 15 '09 at 21:54

bdijkstra

289
1
4

for little bit bigger arrays, 64-bit registers or SIMD can be used to zero out 8/16/32... bytes at a time – phuclv Feb 25 '17 at 09:00

Roddy · Answer 3 · 2009-07-16T08:40:34.980

9

There's no way of answering the question without measuring. It will depend entirely on the compiler, cpu and runtime library implementations.

memset() can be bit of a "code smell", because it can be prone to buffer overflows, parameter reversals and has the unfortunate ability of only clearing 'byte-wise'. However it's a safe bet that it will be 'fastest' in all but extreme cases.

I tend to use a macro to wrap this to avoid some of the issues:

#define CLEAR(s) memset(&(s), 0, sizeof(s))

This sidesteps the size calculations and removes the problem of swapping the length and vlaue parameters.

In short, use memset() "under the hood". Write what you intend, and let the compiler worry about optimizations. Most are incredibly good at it.

edited Jul 16 '09 at 08:40

answered Jul 15 '09 at 22:03

Roddy

66,617
42
165
277

++ Ohmygosh! U use a macro!? Better go into hiding! – Mike Dunlavey Jul 16 '09 at 00:44
1

I think you made the macro parameter (x), but used (s) in the actual body... might want to edit that. – micmoo Jul 16 '09 at 01:19
@micmoo - thanks - fixed. @mike re Macros: yes... however they're unavoidable in C. The C++ answer to this question would be *very* different! – Roddy Jul 16 '09 at 08:41
@Roddy: BTW that was sarcasm :) Regarding the current macros-are-bad religion, I'm a devout skeptic. – Mike Dunlavey Jul 17 '09 at 18:26
8

Your macro would make it hard to spot the problem with code such as this: `char* foo = malloc(80); CLEAR(foo);` – Dan Breslau Jul 30 '09 at 21:44
@Dan, Well, I don't see that as any harder to spot than `char* foo = malloc(80); memset(&foo, 0, 80);` or `memset(foo, 80, 0);`. – Roddy Jul 05 '12 at 08:34
1

@Roddy, there's certainly some room for subjectivity in this discussion. In my experience, a macro call like this adds more obfuscation than it's worth. An experienced programmer would likely have an immediate double-take when reading the `memset` calls that you just wrote. But seeing the error in `char* foo = malloc(80); CLEAR(foo);` requires knowing the definition of `CLEAR`. (Arguably, an experienced programmer would know immediately that no reasonable definition of `CLEAR` could be correct for a dynamically-allocated pointer; but that's not the kind of experience I'd want to rely on.) – Dan Breslau Jul 05 '12 at 15:25

score 1 · Answer 4 · answered Jul 10 '16 at 15:51

1

Considering this code per se evrything is already been told. But if you consider it in its program, of which I don't know nothing, something else can be done. For example, if this code is to be executed every some time to clear an array, you could run a thread that constantly allocates a new array of zero elements assigned to a global variable, which your code, when needs the array to be cleared, simply points to.

This is a third option. Of course if you plan to run your code on a processor with at least two cores and this makes sense. Also the code must be run more than once to see the benefits. For only a one-time run, you could declare an array filled with zeros and then point to it when needed.

Hope this may help someone

answered Jul 10 '16 at 15:51

fedeb

122
1
4

This is only a good idea for quite large arrays. The cache-miss overhead from zeroing on one core and using on another core, not to mention synchronisation overhead, will be killer for small arrays. I'd guess that a separate zeroing thread is only worth considering for arrays bigger than 16kiB. (L1 cache size in modern Intel CPUs is 32kiB) – Peter Cordes Jul 10 '16 at 17:20
Freeing old arrays and allocating new zeroed memory like you suggest might possibly be cheaper, especially if most of the pages won't be written. On Linux, for example, newly-allocated pages from mmap are all copy-on-write mapped to the same physical page of zeroed memory. `calloc(3)` takes advantage of that, and doesn't dirty the pages (unlike `std::vector`). The first write to every page will trigger a soft page fault. – Peter Cordes Jul 10 '16 at 17:23
I thought every core had its own L1 cache, so I can't get where the miss would occur. You may know more than me about caches though. Synchronisation overhead shouldn't be a problem if the array isn't cleared too often. While the clearing thread holds the lock, the other thread would be doing something else. The real drawback is that you need to know really well how often the array needs to be cleared in the "main" thread, which isn't always possible I guess. If it isn't possible, my idea won't work and will slow everything a lot – fedeb Jul 11 '16 at 15:31
Every core has private L1, and that's exactly the problem: The array will be hot in the L1 cache on the core running the cleaning thread, NOT the core that tries to use it. `memset` of a small array in the same thread that uses it leaves that memory hot in L1, so subsequent writes are fast. If it was last written by another thread, it will at best be hot in the shared L3 cache (Intel CPUs since Nehalem use large inclusive L3 caches so coherency traffic doesn't have to go through main memory). At worst, it's still in Modified state in the other core's L1 (MOESI). – Peter Cordes Jul 11 '16 at 17:41
For large arrays, like a couple MiB or something, it might make a bit of sense to maintain a pool of zeroed arrays. IDK if you're gaining much over letting the OS do that for you, though, and using `calloc`. For smaller arrays, `memset` is very cheap. If you're going to touch the array right after zeroing, you should probably do both in the same thread. – Peter Cordes Jul 11 '16 at 17:45
nice caching explanation – fedeb Dec 04 '16 at 21:49

clearing a small integer array: memset vs. for loop

4 Answers4

Linked