ZeroMemory in SSE

Question

I need simple ZeroMemory implementation with SSE (SSE2 prefered) Can someone help with that. I was serching thru SO and net but not found direct answer to that.

I meant illing some set of memory (about 1 MB ) with zeroes (as faster as possible) — grunge fightr, Oct 08 '12 at 18:07
Just look in the source of Agner Fogs optimized subroutine library. Spoiler: nontemporal stores only if the data is big compared to cache size, offsets from the *end* of the block (negative index), and unconditionally do the fixup at the beginning and end. — harold, Oct 08 '12 at 21:18
@GJ. Actually `memset` etc. from gnu libc make heavy use of SSE instructions for exactly that kind of operation. Specifically `movntdq` and similar instructions are intended for exactly this kind of stuff. — Gunther Piez, Oct 08 '12 at 23:05
@harold - got the same speed with mov and movdqa so it is probably max of my ram chips so efforts will give no speedup — grunge fightr, Oct 09 '12 at 06:36

tc. · Answer 1 · 2012-10-08T18:32:28.470

5

Is ZeroMemory() or memset() not good enough?

Disclaimer: Some of the following may be SSE3.

Fill any unaligned leading bytes by looping until the address is a multiple of 16
push to save an xmm reg
pxor to zero the xmm reg
While the remaining length >= 16,
1. movdqa or movntdq to do the write
pop to restore the xmm reg.
Fill any unaligned trailing bytes.

movntdq may appear to be faster because it tells the processor to not bring the data into your cache, but this can cause a performance penalty later if the data is going to be used. It may be more appropriate if you are scrubbing memory before freeing it (like you might do with SecureZeroMemory()).

edited Oct 08 '12 at 18:32

answered Oct 08 '12 at 18:15

tc.

33,468
5
78
96

2

+1 On any reasonable platform, `memset( )` should already be "as fast as possible" (and hence "good enough"). Your pseudocode is odd however; why would you push/pxor/pop inside the loop? – Stephen Canon Oct 08 '12 at 18:19
@StephenCanon: I know that at least `memcmp()` is not "as fast as possible" on some versions of OSX. And oops, failediting! Good spot. – tc. Oct 08 '12 at 18:30
1

`memcmp` is significantly less used (and has significantly more strange tradeoffs in optimizing) than `memset`. `memset` should be one of the very first functions to be optimized on any platform, and is one of the easiest to do well. – Stephen Canon Oct 08 '12 at 19:15

GJ. · Answer 2 · 2012-10-09T13:06:52.513

I you want to speed up your code than you must exactly understand how your CPU works and where is the bottleneck.

Here you are my speed optimized routine just to show how should be made.

On my PC is about 5 time faster (clear 1MBytes mem block) than your, test it and ask if somethink isn't clear:

//edx = memory pointer must be 16 bytes aligned
//ecx = memory count must be multiple of 16 
    xorps       xmm0, xmm0                      //Clear xmm0
    mov         eax, ecx                        //Save ecx to eax
    and         ecx, 0FFFFFF80h                 //Clear only 128 byte pages
    jz          @ClearRest                      //Less than 128 bytes to clear
@Aligned128BMove:
    movdqa      [edx], xmm0                     //Clear first 16 bytes of 128 bytes 
    movdqa      [edx + 10h], xmm0               //Clear second 16 bytes of 128 bytes 
    movdqa      [edx + 20h], xmm0               //...
    movdqa      [edx + 30h], xmm0
    movdqa      [edx + 40h], xmm0
    movdqa      [edx + 50h], xmm0
    movdqa      [edx + 60h], xmm0
    movdqa      [edx + 70h], xmm0
    add         edx, 128                        //inc mem pointer
    sub         ecx, 128                        //dec counter
    jnz         @Aligned128BMove
@ClearRest:
    and         eax, 07Fh                       //Clear the rest
    jz          @Exit
@LoopRest:
    movdqa      [edx], xmm0
    add         edx, 16
    sub         eax, 16
    jnz         @LoopRest
@Exit:

Tried it it is slower than movntdq - if revritten in movntdq it is same speed do not mater if 16 or 128 set — grunge fightr, Oct 09 '12 at 14:29
measured again and movntdq is twice as fast as winapi ZeroMemory at least in the context of my application (when I fired it 100 x as a some array cleareer it was 160 ms versus 75 ms for movntdq) — grunge fightr, Oct 09 '12 at 15:58

score 0 · Answer 3 · answered Oct 12 '12 at 00:49

Almost all of the transistors in your CPU are used to somehow make memory access as fast as possible. The CPU is already doing an amazing job at all memory accesses, and the instructions run at a drastically faster rate than possible memory accesses.

Therefore, trying to beat memset is a mostly futile exercise in most cases because it is already limited by the speed of your memory (as mentioned by others).

ZeroMemory in SSE

3 Answers3

Linked