Is memset() more efficient than for loop in C?

Question

Is memset() more efficient than for loop.

Considering this code:

char x[500];
memset(x,0,sizeof(x));

And this one:

char x[500];
for(int i = 0 ; i < 500 ; i ++) x[i] = 0;

Which one is more efficient and why? Is there any special instruction in hardware to do block level initialization.

Yes. No. Maybe. It depends. The only way to get a useful answer is to analyze and profile it *in your environment*. Which one is faster on my compiler, in my program, on my computer, tells you nothing useful. — Robᵩ, Sep 09 '11 at 21:34
Compile both and benchmark them. The answer depends on what kind of computer you're using, what compiler you use, what standard library you use, the size of the block you're trying to change, the phase of the moon... — Chris Lutz, Sep 09 '11 at 21:34
@Chris: You likely don't need to even go that far. Just peeking at the assembly output should be sufficient. — Ed S., Sep 09 '11 at 21:53
Why bother investigating? Unless there is data to show otherwise (you are failing your perf goals and investigation points to this section of code), this piece of code is likely not a hotspot, and you should just go for as simple, readable, and maintainable code possible. — Michael, Sep 09 '11 at 21:54
@Ed S. - Not everyone can read assembly. Everyone can read numbers. (Well, everyone who programs.) — Chris Lutz, Sep 09 '11 at 21:55
If you have a compiler that doesn't substitute that loop with memset() then you should find another compiler. — Hans Passant, Sep 09 '11 at 22:00
@Chris: Ummmm.... they should probably learn then. I guess I'm a 27 year old dinosaur, but I have a problem with so-called "engineers" who can't read basic assembly... I don't mean to imply that one shouldn't use a profiler, but for such a trivial comparison it should be unnecessary. — Ed S., Sep 09 '11 at 22:07
@Ed S. - If you program in C extensively, then I agree. If you're a web programmer who doesn't really need to work with anything lower-level than Python, meh. I'll never say "you don't need to learn something," but in some contexts knowledge of assembly may not be very useful (and in some contexts knowledge of the operating system's internal workings may be _more_ useful). — Chris Lutz, Sep 09 '11 at 22:09
@Chris: And that is why so many web guys (and gals) that I have come into contact with write applications that are much slower than they should be. Not necessarily because they can't read assembly, but because they never really learned the performance characteristics of the data structures they use and how their high level code may be executed when it is turned into machine code. I digress though, that's a discussion for another place and time. — Ed S., Sep 09 '11 at 22:11
@Ed S. - I don't disagree with you - I'd love it if Notch learned C/assembly/CS theory/whatever and made Minecraft run at a consistent speed. — Chris Lutz, Sep 09 '11 at 22:19
And if you only need to do it once, do it at the definition: `char x[500] = {0};`, which won't have any effect on running speed but makes the code look nicer to me. — pmg, Dec 06 '18 at 16:09

Diego Sevilla · Accepted Answer · 2019-02-18T09:20:09.043

49

Most certainly, memset will be much faster than that loop. Note how you treat one character at a time, but those functions are so optimized that set several bytes at a time, even using, when available, MMX and SSE instructions.

I think the paradigmatic example of these optimizations, that go unnoticed usually, is the GNU C library strlen function. One would think that it has at least O(n) performance, but it actually has O(n/4) or O(n/8) depending on the architecture (yes, I know, in big O() will be the same, but you actually get an eighth of the time). How? Tricky, but nicely: strlen.

edited Feb 18 '19 at 09:20

answered Sep 09 '11 at 21:37

Diego Sevilla

28,636
4
59
87

6

Any optimizing compiler will replace the for loop with an optimal sequence (which may be a call to memset). – Stephen Canon Sep 09 '11 at 21:42
Also, for it to be "much faster" is not guaranteed, even if the compiler emits sub-optimal code for the loop. 500 is not really that high of a number, and if a soft or hard page fault is incurred, that will greatly outweigh the cost of the loop itself. – Michael Sep 09 '11 at 21:45
2

@Stephen Canon: Heh. I was compiling a C library with clang/LLVM and it replaced the library's memset for loop with a call to memset. Oops! Deep recursion. – Richard Pennington Sep 09 '11 at 21:51
@Michael: those faults are out of the question. Of course, they can happen, but when the performance of making, say, 500/8 assigns is slower than doing 500 assignations to 0? Also, I think the OP used 500 as an example. – Diego Sevilla Sep 09 '11 at 21:55
1

@Diego it's not a question of 500/8 assigns being slower or faster than 500 assigns to 0. Micro-benchmarks like this are rarely useful due to other effects in the system. On a modern processor, the difference between only the comparisons will probably be on the order of 62 cycles versus 500 cycles. What I'm suggesting is if you incur a hard page fault on the order of 10 million cycles while running the code, the 438 cycles you saved are just noise. – Michael Sep 09 '11 at 22:00
Also, unless this chunk of code executes hundreds of thousands of times each second, a 438 cycle difference will not be user perceptible and you're wasting your time optimizing non-issues. – Michael Sep 09 '11 at 22:01
@Michael: then why micro-optimize? :) – Diego Sevilla Sep 09 '11 at 22:02
@Diego: It sounds like we are in agreement :) Micro-optimizations are best left to the compiler which can apply them wholesale to your entire codebase. Hand micro-optimization is appropriate when you've identifying a piece of code as greatly impacting your performance and you cannot identify any algorithmic improvements. My suggestion is to stick to memset since it is one line of code versus 4, which I think is more readable. – Michael Sep 09 '11 at 22:04
@Michael: I agree with you in some points... Anyway, I was trying to be sarcastic... :) I agree in one line vs. four lines. Also, I'd say use it because this is what it is there for. *But* micro-optimizations are important, even at this level. Yes, here we have 500 elements, and a function that maybe in this program is called once. But what if afterwards that function is used as a part of other computation and get called one million times? Then those 400 cycles are 400 seconds... – Diego Sevilla Sep 09 '11 at 22:08
2

@Richard Pennington: `-fno-builtin-memset`. – Stephen Canon Sep 09 '11 at 22:30
It is wrong. O is the worst performance, and that is O(n) – consider string of length 1... Ω is the best performance and it is Ω(n/8). That is for clarity. :) – Velda Oct 23 '15 at 16:54
Sometimes O() is used instead of o() (lower case sigma). But we all understand here that the difference between both is the constants, that are not considered in the big O. – Diego Sevilla Oct 24 '15 at 08:52
Is it somehow possible that, multiple processor split memset among different processors... any idea ? – vicky Nov 05 '15 at 11:57
The real bottleneck is memory access... So multiple processors won't make it faster. Alas, you have to coordinate them, and memory copying is a somewhat fast operation compared with coordination. – Diego Sevilla Nov 05 '15 at 17:43
Link is broken. What are you trying to illustrate? – SOFe Feb 16 '19 at 12:41
@SOFe It's fixed now. The implementation given uses a trick to read null-terminated strings 4 or 8 bytes at a time and can use a simple function to determine if any of those bytes contains a null terminator. That way it doesn't have to check each byte individually. – puppydrum64 Nov 21 '22 at 15:44

Ed S. · Answer 2 · 2011-09-09T22:23:05.337

40

Well, why don't we take a look at the generated assembly code, full optimization under VS 2010.

char x[500];
char y[500];
int i;      

memset(x, 0, sizeof(x) );   
  003A1014  push        1F4h  
  003A1019  lea         eax,[ebp-1F8h]  
  003A101F  push        0  
  003A1021  push        eax  
  003A1022  call        memset (3A1844h)

And your loop...

char x[500];
char y[500];
int i;    

for( i = 0; i < 500; ++i )
{
    x[i] = 0;

      00E81014  push        1F4h  
      00E81019  lea         eax,[ebp-1F8h]  
      00E8101F  push        0  
      00E81021  push        eax  
      00E81022  call        memset (0E81844h)  

      /* note that this is *replacing* the loop, 
         not being called once for each iteration. */
}

So, under this compiler, the generated code is exactly the same. memset is fast, and the compiler is smart enough to know that you are doing the same thing as calling memset once anyway, so it does it for you.

If the compiler actually left the loop as-is then it would likely be slower as you can set more than one byte size block at a time (i.e., you could unroll your loop a bit at a minimum. You can assume that memset will be at least as fast as a naive implementation such as the loop. Try it under a debug build and you will notice that the loop is not replaced.

That said, it depends on what the compiler does for you. Looking at the disassembly is always a good way to know exactly what is going on.

edited Sep 09 '11 at 22:23

answered Sep 09 '11 at 21:45

Ed S.

122,712
22
185
265

Interesting, my version did not result in the loop being converted to memset, but that is probably because for my test, the loop operated on a global (otherwise the entire loop was removed as being unnecessary.) – Michael Sep 09 '11 at 21:53
@Michael: I added a couple of calls to `printf` using `x` and `y` to ensure that they weren't completely optimized away as they are unused. It is of course compiler and platform dependent to some degree, but any half way decent optimizing compiler should get rid of the loop with optimizations turned on. – Ed S. Sep 09 '11 at 22:06
even memset() vs array initialization (Ex: a[n]={0}) takes same code it looks. Advantage of of memset is that array size can be a variable, which is not possible with initialization. Am I right? – Rajesh Jun 21 '20 at 12:39
How do you inspect the "disassembly". – young_souvlaki Oct 22 '20 at 03:36
@young_souvlaki: In VS? https://learn.microsoft.com/en-us/visualstudio/debugger/how-to-use-the-disassembly-window?view=vs-2019 – Ed S. Nov 13 '20 at 17:29

score 13 · Answer 3 · answered Sep 09 '11 at 21:35

It really depends on the compiler and library. For older compilers or simple compilers, memset may be implemented in a library and would not perform better than a custom loop.

For nearly all compilers that are worth using, memset is an intrinsic function and the compiler will generate optimized, inline code for it.

Others have suggested profiling and comparing, but I wouldn't bother. Just use memset. Code is simple and easy to understand. Don't worry about it until your benchmarks tell you this part of code is a performance hotspot.

Bobby Powers · Answer 4 · 2011-09-09T21:44:59.453

The answer is 'it depends'. memset MAY be more efficient, or it may internally use a for loop. I can't think of a case where memset will be less efficient. In this case, it may turn into a more efficient for loop: your loop iterates 500 times setting a bytes worth of the array to 0 every time. On a 64 bit machine, you could loop through, setting 8 bytes (a long long) at a time, which would be almost 8 times quicker, and just dealing with the remaining 4 bytes (500%8) at the end.

EDIT:

in fact, this is what memset does in glibc:

http://repo.or.cz/w/glibc.git/blob/HEAD:/string/memset.c

As Michael pointed out, in certain cases (where the array length is known at compile time), the C compiler can inline memset, getting rid of the overhead of the function call. Glibc also has assembly optimized versions of memset for most major platforms, like amd64:

http://repo.or.cz/w/glibc.git/blob/HEAD:/sysdeps/x86_64/memset.S

I can think of a situation where `memset` will be less efficient: a compiler that can't inline (well). — orlp, Sep 09 '11 at 21:53
I imagine if both the second two arguments were known at compile time most humans would be hard pressed to equal the compiler generated code. — Chris Lutz, Sep 09 '11 at 21:54

score 3 · Answer 5 · answered Sep 09 '11 at 21:38

3

Good compilers will recognize the for loop and replace it with either an optimal inline sequence or a call to memset. They will also replace memset with an optimal inline sequence when the buffer size is small.

In practice, with an optimizing compiler the generated code (and therefore performance) will be identical.

answered Sep 09 '11 at 21:38

Stephen Canon

103,815
19
183
269

Can you provide a citation? – Translunar Dec 10 '15 at 16:24
Try it and see with any good optimizing compiler (e.g. http://goo.gl/2mWsxq). I'm not sure what there is to "cite" here. – Stephen Canon Dec 10 '15 at 17:50
Academic citations are always important, even if just grey literature. – Patrick Jan 18 '20 at 11:08

score 2 · Answer 6 · answered Sep 09 '11 at 21:40

2

Agree with above. It depends. But, for sure memset is faster or equal to the for-loop. If you are uncertain of your environment or too lazy to test, take the safe route and go with memset.

answered Sep 09 '11 at 21:40

beetree

871
2
8
18

Please specify which one you mean by "above." – Translunar Dec 10 '15 at 16:25

Rachid K. · Answer 7 · 2021-02-11T19:16:34.423

Other techniques like loop unrolling which reduce the number of loops can also be used. The code of memset() can mimic the famous duff's device:

void *duff_memset(char *to, int c, size_t count)
{
    size_t n;
    char *p = to;
    n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *p++ = c;
    case 7:      *p++ = c;
    case 6:      *p++ = c;
    case 5:      *p++ = c;
    case 4:      *p++ = c;
    case 3:      *p++ = c;
    case 2:      *p++ = c;
    case 1:      *p++ = c;
            } while (--n > 0);
    }
    return to;
}

Those tricks used to enhancing the execution speed in the past. But on modern architectures this tends to increase the code size and increase cache misses.

So, it is quite impossible to say which implementation is faster as it depends on the quality of the compiler optimizations, the ability of the C library to take advantage of special hardware instructions, the amount of data you are operating on and the features of the underlying operating system (page faults management, TLB misses, Copy-On-Write).

For example, in the glibc, the implementation of memset() as well as various other "copy/set" functions like bzero() or strcpy() are architecture dependent to take advantage of various optimized hardware instructions like SSE or AVX.

Is memset() more efficient than for loop in C?

7 Answers7

Linked