I need simple ZeroMemory implementation with SSE (SSE2 prefered) Can someone help with that. I was serching thru SO and net but not found direct answer to that.
-
1Which ZeroMemory? SSE isn't intended for that. – GJ. Oct 08 '12 at 18:01
-
I meant illing some set of memory (about 1 MB ) with zeroes (as faster as possible) – grunge fightr Oct 08 '12 at 18:07
-
Just look in the source of Agner Fogs optimized subroutine library. Spoiler: nontemporal stores only if the data is big compared to cache size, offsets from the *end* of the block (negative index), and unconditionally do the fixup at the beginning and end. – harold Oct 08 '12 at 21:18
-
@GJ. Actually `memset` etc. from gnu libc make heavy use of SSE instructions for exactly that kind of operation. Specifically `movntdq` and similar instructions are intended for exactly this kind of stuff. – Gunther Piez Oct 08 '12 at 23:05
-
@harold - got the same speed with mov and movdqa so it is probably max of my ram chips so efforts will give no speedup – grunge fightr Oct 09 '12 at 06:36
3 Answers
Is ZeroMemory()
or memset()
not good enough?
Disclaimer: Some of the following may be SSE3.
- Fill any unaligned leading bytes by looping until the address is a multiple of 16
push
to save an xmm regpxor
to zero the xmm reg- While the remaining length >= 16,
movdqa
ormovntdq
to do the write
pop
to restore the xmm reg.- Fill any unaligned trailing bytes.
movntdq
may appear to be faster because it tells the processor to not bring the data into your cache, but this can cause a performance penalty later if the data is going to be used. It may be more appropriate if you are scrubbing memory before freeing it (like you might do with SecureZeroMemory()
).

- 33,468
- 5
- 78
- 96
-
2+1 On any reasonable platform, `memset( )` should already be "as fast as possible" (and hence "good enough"). Your pseudocode is odd however; why would you push/pxor/pop inside the loop? – Stephen Canon Oct 08 '12 at 18:19
-
@StephenCanon: I know that at least `memcmp()` is not "as fast as possible" on some versions of OSX. And oops, failediting! Good spot. – tc. Oct 08 '12 at 18:30
-
1`memcmp` is significantly less used (and has significantly more strange tradeoffs in optimizing) than `memset`. `memset` should be one of the very first functions to be optimized on any platform, and is one of the easiest to do well. – Stephen Canon Oct 08 '12 at 19:15
I you want to speed up your code than you must exactly understand how your CPU works and where is the bottleneck.
Here you are my speed optimized routine just to show how should be made.
On my PC is about 5 time faster (clear 1MBytes mem block) than your, test it and ask if somethink isn't clear:
//edx = memory pointer must be 16 bytes aligned
//ecx = memory count must be multiple of 16
xorps xmm0, xmm0 //Clear xmm0
mov eax, ecx //Save ecx to eax
and ecx, 0FFFFFF80h //Clear only 128 byte pages
jz @ClearRest //Less than 128 bytes to clear
@Aligned128BMove:
movdqa [edx], xmm0 //Clear first 16 bytes of 128 bytes
movdqa [edx + 10h], xmm0 //Clear second 16 bytes of 128 bytes
movdqa [edx + 20h], xmm0 //...
movdqa [edx + 30h], xmm0
movdqa [edx + 40h], xmm0
movdqa [edx + 50h], xmm0
movdqa [edx + 60h], xmm0
movdqa [edx + 70h], xmm0
add edx, 128 //inc mem pointer
sub ecx, 128 //dec counter
jnz @Aligned128BMove
@ClearRest:
and eax, 07Fh //Clear the rest
jz @Exit
@LoopRest:
movdqa [edx], xmm0
add edx, 16
sub eax, 16
jnz @LoopRest
@Exit:

- 10,810
- 2
- 45
- 62
-
-
1Tried it it is slower than movntdq - if revritten in movntdq it is same speed do not mater if 16 or 128 set – grunge fightr Oct 09 '12 at 14:29
-
measured again and movntdq is twice as fast as winapi ZeroMemory at least in the context of my application (when I fired it 100 x as a some array cleareer it was 160 ms versus 75 ms for movntdq) – grunge fightr Oct 09 '12 at 15:58
Almost all of the transistors in your CPU are used to somehow make memory access as fast as possible. The CPU is already doing an amazing job at all memory accesses, and the instructions run at a drastically faster rate than possible memory accesses.
Therefore, trying to beat memset is a mostly futile exercise in most cases because it is already limited by the speed of your memory (as mentioned by others).

- 6,562
- 3
- 43
- 53