55

Here's a simple memset bandwidth benchmark:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

int main()
{
    unsigned long n, r, i;
    unsigned char *p;
    clock_t c0, c1;
    double elapsed;

    n = 1000 * 1000 * 1000; /* GB */
    r = 100; /* repeat */

    p = calloc(n, 1);

    c0 = clock();

    for(i = 0; i < r; ++i) {
        memset(p, (int)i, n);
        printf("%4d/%4ld\r", p[0], r); /* "use" the result */
        fflush(stdout);
    }

    c1 = clock();

    elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;

    printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);

    free(p);
}

On my system (details below) with a single DDR3-1600 memory module, it outputs:

Bandwidth = 4.751 GB/s (Giga = 10^9)

This is 37% of the theoretical RAM speed: 1.6 GHz * 8 bytes = 12.8 GB/s

On the other hand, here's a similar "read" test:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

unsigned long do_xor(const unsigned long* p, unsigned long n)
{
    unsigned long i, x = 0;

    for(i = 0; i < n; ++i)
        x ^= p[i];
    return x;
}

int main()
{
    unsigned long n, r, i;
    unsigned long *p;
    clock_t c0, c1;
    double elapsed;

    n = 1000 * 1000 * 1000; /* GB */
    r = 100; /* repeat */

    p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));

    c0 = clock();

    for(i = 0; i < r; ++i) {
        p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
        printf("%4ld/%4ld\r", i, r);
        fflush(stdout);
    }

    c1 = clock();

    elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;

    printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);

    free(p);
}

It outputs:

Bandwidth = 11.516 GB/s (Giga = 10^9)

I can get close to the theoretical limit for read performance, such as XORing a large array, but writing appears to be much slower. Why?

OS Ubuntu 14.04 AMD64 (I compile with gcc -O3. Using -O3 -march=native makes the read performance slightly worse, but does not affect memset)

CPU Xeon E5-2630 v2

RAM A single "16GB PC3-12800 Parity REG CL11 240-Pin DIMM" (What it says on the box) I think that having a single DIMM makes performance more predictable. I'm assuming that with 4 DIMMs, memset will be up to 4 times faster.

Motherboard Supermicro X9DRG-QF (Supports 4-channel memory)

Additional system: A laptop with 2x 4GB of DDR3-1067 RAM: read and write are both about 5.5 GB/s, but note that it uses 2 DIMMs.

P.S. replacing memset with this version results in exactly the same performance

void *my_memset(void *s, int c, size_t n)
{
    unsigned long i = 0;
    for(i = 0; i < n; ++i)
        ((char*)s)[i] = (char)c;
    return s;
}
MWB
  • 11,740
  • 6
  • 46
  • 91
  • 11
    `printf("%4d/%4ld\r", p[0], r);` in your benchmark means you're most likely timing that rather than anything else. I/O is slow. – Retired Ninja Sep 13 '14 at 20:35
  • 5
    @RetiredNinja No! `printf` is called 101 times in a program that runs for 20 seconds – MWB Sep 13 '14 at 20:37
  • Is there any paging occurring? – Barmar Sep 13 '14 at 20:39
  • 5
    In the code you posted it should be called 100 times. There's no reason for it to be in the part of the code you are benchmarking. – Retired Ninja Sep 13 '14 at 20:39
  • @RetiredNinja It's good practice to "use" the results of your benchmark calculations, otherwise the compiler may elide the whole computation (in many cases, depending on the specifics). Also, it provides "progress" so you know how long to wait. – MWB Sep 13 '14 at 20:44
  • 2
    I tried it on my system with and without the printf in the loop. The difference was smaller than I expected (run 3 times). With, I got 9.644, 9.667 and 9.629, without I got 9.740, 9.614 and 9.653 – some Sep 13 '14 at 20:44
  • 1
    Probably a matter of cache policy (which is processor specific). – Basile Starynkevitch Sep 13 '14 at 20:53
  • 2
    My 2010 old MacBook reports 1.937 GB/s without optimisation, and 173010.381 GB/s with optimisation with the posted code, unmodified :-) Most likely the memset writes to a cache line which is first read from RAM to cache in order to be modified, and then flushed, so each cache line is read + written instead of just read. The remaining difference will likely be due to reading/writing at non-contiguous locations. PowerPC had instructions to clear cache lines, which would have helped. – gnasher729 Sep 13 '14 at 20:54
  • 1
    @user2864740 11.5 GB/s for XORing. Seriously, printf here is negligible. I never gave it any thought. I'm surprised people are obsessing with it here. – MWB Sep 13 '14 at 20:59
  • In any multiprocessor environment maintaining cache coherence will cause writes to be slower than reads, overall. – Hot Licks Sep 13 '14 at 21:06
  • @gnasher729 what's your compiler? Thwarting the optimizations in a trivial benchmark is an interactive process. You probably need to "use" the results more somehow. – MWB Sep 13 '14 at 21:09
  • @HotLicks This is single-threaded. If you think your comment still applies, perhaps post it as an answer? – MWB Sep 13 '14 at 21:10
  • @user2864740 I added a "read" benchmark to the question. – MWB Sep 13 '14 at 21:11
  • @Barmar ... there's no swapping (verified with free -m). The program allocates 1GB on a system with 16GB of RAM – MWB Sep 13 '14 at 21:17
  • 1
    I can't reproduce your timing difference on my machine. In the contrary, your xor bench is even a bit slower. Did you compile with `-O3 -march=native`? Also, for the same optimization, clang is able to optimize the loop completely out for the `memset` benchmark. – Jens Gustedt Sep 13 '14 at 21:38
  • @JensGustedt Using `-O3 -march=native` makes the read performance slightly worse, but does not affect `memset` for me (edited the question) – MWB Sep 13 '14 at 21:48
  • 1
    BTW, you are not measuring write performance, but performance of your `memset` in your C library (presumably glibc) on your architecture. – Jens Gustedt Sep 13 '14 at 22:02
  • I would be interested in seeing this benchmark when compiled for and run on a PC with FreeDOS-32 as the OS. That way, the overhead of the virtual memory manager and paging can be largely eliminated. – selbie Sep 13 '14 at 22:51
  • 1
    CLOCKS_PER_SEC will most assuredly have the wrong value. Modern processors get a more or less dynamic clock, it can vary wildly. One will have to read the current (!) clock-value immediately before you use it - but that only works if your program is VERY fast .. in fact, one would have to read the value after EVERY PROCESSOR STEP but thats very hard to implement and would yield only minor precision-improvements – specializt Sep 13 '14 at 23:13
  • @specializt CLOCKS_PER_SEC has a misleading name, but its (constant) value is **defined by the C standard**. I seriously doubt that it's "wrong". – MWB Sep 13 '14 at 23:16
  • This is a benchmark comparison between a standard library function (`memset`) and your proprietary function (`do_xor`), not between read operation and write operation. – barak manos Sep 13 '14 at 23:22
  • @barakmanos I added a version with my own implementation of `memset` – MWB Sep 13 '14 at 23:41
  • If the C library version of `memset` is really equal to the one you give, your installation didn't get it right. On modern archs, this should be more sophisticated than that and combine writes to successive memory and things like that. So this boils more and more down to a configuration problem then anything else. Any "answer" to your question would be purely speculative. Perhaps you should close it. – Jens Gustedt Sep 14 '14 at 06:07
  • why is writing a book much slower than reading one? har... writing has to find allocated space and make a record of where that space is for when reading. Reading just looks at that record like a table of contents and proceeds to those locations, which as you know may not be altogether in one chunk. – Gary Hayes Sep 14 '14 at 10:27
  • For raw memory reading speed I found it to be much more accurate to read always 8 bytes in steps of 64 bytes (or wahtever your CPUS cacheline is). This causes all of the memory transferred to L2, with minimal CPU usage. I dont know much about how writing works in detail, but maybe a similar mechanism can be used to reduce all overhead. – PlasmaHH Sep 14 '14 at 10:40
  • This question appears to be off-topic because it already makes a false claim in the question title. It is much architecture dependent and there can't a clear cut answer. – Jens Gustedt Sep 14 '14 at 15:08
  • @Jens There certainly can be a clear cut answer: namely, "your assumption is wrong, it's architecture dependent and here's some of the factors involved, with proof". It's a useful question since this misconception will come up time and again. – Chris Hayes Sep 14 '14 at 18:21

7 Answers7

49

With your programs, I get

(write) Bandwidth =  6.076 GB/s
(read)  Bandwidth = 10.916 GB/s

on a desktop (Core i7, x86-64, GCC 4.9, GNU libc 2.19) machine with six 2GB DIMMs. (I don't have any more detail than that to hand, sorry.)

However, this program reports write bandwidth of 12.209 GB/s:

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <emmintrin.h>

static void
nt_memset(char *buf, unsigned char val, size_t n)
{
    /* this will only work with aligned address and size */
    assert((uintptr_t)buf % sizeof(__m128i) == 0);
    assert(n % sizeof(__m128i) == 0);

    __m128i xval = _mm_set_epi8(val, val, val, val,
                                val, val, val, val,
                                val, val, val, val,
                                val, val, val, val);

    for (__m128i *p = (__m128i*)buf; p < (__m128i*)(buf + n); p++)
        _mm_stream_si128(p, xval);
    _mm_sfence();
}

/* same main() as your write test, except calling nt_memset instead of memset */

The magic is all in _mm_stream_si128, aka the machine instruction movntdq, which writes a 16-byte quantity to system RAM, bypassing the cache (the official jargon for this is "non-temporal store"). I think this pretty conclusively demonstrates that the performance difference is all about the cache behavior.

N.B. glibc 2.19 does have an elaborately hand-optimized memset that makes use of vector instructions. However, it does not use non-temporal stores. That's probably the Right Thing for memset; in general, you clear memory shortly before using it, so you want it to be hot in the cache. (I suppose an even cleverer memset might switch to non-temporal stores for really huge block clear, on the theory that you could not possibly want all of that in the cache, because the cache simply isn't that big.)

Dump of assembler code for function memset:
=> 0x00007ffff7ab9420 <+0>:     movd   %esi,%xmm8
   0x00007ffff7ab9425 <+5>:     mov    %rdi,%rax
   0x00007ffff7ab9428 <+8>:     punpcklbw %xmm8,%xmm8
   0x00007ffff7ab942d <+13>:    punpcklwd %xmm8,%xmm8
   0x00007ffff7ab9432 <+18>:    pshufd $0x0,%xmm8,%xmm8
   0x00007ffff7ab9438 <+24>:    cmp    $0x40,%rdx
   0x00007ffff7ab943c <+28>:    ja     0x7ffff7ab9470 <memset+80>
   0x00007ffff7ab943e <+30>:    cmp    $0x10,%rdx
   0x00007ffff7ab9442 <+34>:    jbe    0x7ffff7ab94e2 <memset+194>
   0x00007ffff7ab9448 <+40>:    cmp    $0x20,%rdx
   0x00007ffff7ab944c <+44>:    movdqu %xmm8,(%rdi)
   0x00007ffff7ab9451 <+49>:    movdqu %xmm8,-0x10(%rdi,%rdx,1)
   0x00007ffff7ab9458 <+56>:    ja     0x7ffff7ab9460 <memset+64>
   0x00007ffff7ab945a <+58>:    repz retq 
   0x00007ffff7ab945c <+60>:    nopl   0x0(%rax)
   0x00007ffff7ab9460 <+64>:    movdqu %xmm8,0x10(%rdi)
   0x00007ffff7ab9466 <+70>:    movdqu %xmm8,-0x20(%rdi,%rdx,1)
   0x00007ffff7ab946d <+77>:    retq   
   0x00007ffff7ab946e <+78>:    xchg   %ax,%ax
   0x00007ffff7ab9470 <+80>:    lea    0x40(%rdi),%rcx
   0x00007ffff7ab9474 <+84>:    movdqu %xmm8,(%rdi)
   0x00007ffff7ab9479 <+89>:    and    $0xffffffffffffffc0,%rcx
   0x00007ffff7ab947d <+93>:    movdqu %xmm8,-0x10(%rdi,%rdx,1)
   0x00007ffff7ab9484 <+100>:   movdqu %xmm8,0x10(%rdi)
   0x00007ffff7ab948a <+106>:   movdqu %xmm8,-0x20(%rdi,%rdx,1)
   0x00007ffff7ab9491 <+113>:   movdqu %xmm8,0x20(%rdi)
   0x00007ffff7ab9497 <+119>:   movdqu %xmm8,-0x30(%rdi,%rdx,1)
   0x00007ffff7ab949e <+126>:   movdqu %xmm8,0x30(%rdi)
   0x00007ffff7ab94a4 <+132>:   movdqu %xmm8,-0x40(%rdi,%rdx,1)
   0x00007ffff7ab94ab <+139>:   add    %rdi,%rdx
   0x00007ffff7ab94ae <+142>:   and    $0xffffffffffffffc0,%rdx
   0x00007ffff7ab94b2 <+146>:   cmp    %rdx,%rcx
   0x00007ffff7ab94b5 <+149>:   je     0x7ffff7ab945a <memset+58>
   0x00007ffff7ab94b7 <+151>:   nopw   0x0(%rax,%rax,1)
   0x00007ffff7ab94c0 <+160>:   movdqa %xmm8,(%rcx)
   0x00007ffff7ab94c5 <+165>:   movdqa %xmm8,0x10(%rcx)
   0x00007ffff7ab94cb <+171>:   movdqa %xmm8,0x20(%rcx)
   0x00007ffff7ab94d1 <+177>:   movdqa %xmm8,0x30(%rcx)
   0x00007ffff7ab94d7 <+183>:   add    $0x40,%rcx
   0x00007ffff7ab94db <+187>:   cmp    %rcx,%rdx
   0x00007ffff7ab94de <+190>:   jne    0x7ffff7ab94c0 <memset+160>
   0x00007ffff7ab94e0 <+192>:   repz retq 
   0x00007ffff7ab94e2 <+194>:   movq   %xmm8,%rcx
   0x00007ffff7ab94e7 <+199>:   test   $0x18,%dl
   0x00007ffff7ab94ea <+202>:   jne    0x7ffff7ab950e <memset+238>
   0x00007ffff7ab94ec <+204>:   test   $0x4,%dl
   0x00007ffff7ab94ef <+207>:   jne    0x7ffff7ab9507 <memset+231>
   0x00007ffff7ab94f1 <+209>:   test   $0x1,%dl
   0x00007ffff7ab94f4 <+212>:   je     0x7ffff7ab94f8 <memset+216>
   0x00007ffff7ab94f6 <+214>:   mov    %cl,(%rdi)
   0x00007ffff7ab94f8 <+216>:   test   $0x2,%dl
   0x00007ffff7ab94fb <+219>:   je     0x7ffff7ab945a <memset+58>
   0x00007ffff7ab9501 <+225>:   mov    %cx,-0x2(%rax,%rdx,1)
   0x00007ffff7ab9506 <+230>:   retq   
   0x00007ffff7ab9507 <+231>:   mov    %ecx,(%rdi)
   0x00007ffff7ab9509 <+233>:   mov    %ecx,-0x4(%rdi,%rdx,1)
   0x00007ffff7ab950d <+237>:   retq   
   0x00007ffff7ab950e <+238>:   mov    %rcx,(%rdi)
   0x00007ffff7ab9511 <+241>:   mov    %rcx,-0x8(%rdi,%rdx,1)
   0x00007ffff7ab9516 <+246>:   retq   

(This is in libc.so.6, not the program itself -- the other person who tried to dump the assembly for memset seems only to have found its PLT entry. The easiest way to get the assembly dump for the real memset on a Unixy system is

$ gdb ./a.out
(gdb) set env LD_BIND_NOW t
(gdb) b main
Breakpoint 1 at [address]
(gdb) r
Breakpoint 1, [address] in main ()
(gdb) disas memset
...

.)

zwol
  • 135,547
  • 38
  • 252
  • 361
  • Ah, I figured I must have been wrong about `memset`, thanks for posting the correct disassembly. And it's great to know that trick in gdb! – Patrick Collins Sep 17 '14 at 09:01
  • The main reason `movnt` stores can give better write bandwidth for large memsets is that they're weakly-ordered. They can skip the read-for-ownership step when writing to a fresh cache-line, because they're not guaranteed to be globally visible in order with each other or with respect to normal stores. On CPUs with "fast string operations" (Intel IvB and later), `rep stos` uses somewhat-weakly-ordered stores to get the same speedup, but doesn't bypass the cache. As I understand the docs, there's a store fence at the end of the operation, so just don't store the flag as part of memset/cpy. – Peter Cordes Sep 29 '15 at 11:39
  • @PeterCordes if I understand your comment, Is the CPU core reading for a cache line even when it is going to be completely overwritten? There is any way to force this "weakly" behaviour in other write instructions? (I mean, there is a way to write to memory without reading it first, and keeping the data in cache?) – Will Jan 14 '20 at 15:51
  • @Will: For other stores to work that way, you have to be writing to a region of memory that's WC (uncacheable write-combining) instead of normal WB, set using the MTRRs or PAT. You normally can't easily allocate memory that way from user-space under most OSes, and it makes efficient read difficult. See also [Enhanced REP MOVSB for memcpy](//stackoverflow.com/q/43343231) for more about NT stores vs. regular. Yes, normal strongly-ordered stores always do an RFO (read for ownership) before committing data to L1d cache in Modified state, vs. just invaliding other caches and going to DRAM. – Peter Cordes Jan 15 '20 at 00:58
30

The main difference in the performance comes from the caching policy of your PC/memory region. When you read from a memory and the data is not in the cache, the memory must be first fetched to the cache through memory bus before you can perform any computation with the data. However, when you write to memory there are different write policies. Most likely your system is using write-back cache (or more precisely "write allocate"), which means that when you write to a memory location that's not in the cache, the data is first fetched from the memory to the cache and eventually written back to memory when the data is evicted from cache, which means round-trip for the data and 2x bus bandwidth usage upon writes. There is also write-through caching policy (or "no-write allocate") which generally means that upon cache-miss at writes the data isn't fetched to the cache, and which should give closer to the same performance for both reads and writes.

JarkkoL
  • 1,898
  • 11
  • 17
  • Thanks for confirming my earlier guess (I posted it ~30 min earlier)! I'm going to accept it, until/unless someone convinces me that it's factually inaccurate. – MWB Sep 14 '14 at 16:01
  • On some platforms you can actually control the caching policy per allocation, and write performance is one of the reasons. – JarkkoL Sep 14 '14 at 23:54
  • Conventional architectures will write back all dirty data to memory at some point of time. Nowadays, many platforms are trying to improve the performance by means of additional cache control features. For example, platforms like Cavium Octeon provide special cache control policies like DWB(Don't Write Back) options to not write back L2 cache data. Due to this unnecessary L2 data write backs to memory can be avoided. – Karthik Balaguru Oct 02 '14 at 10:25
16

The difference -- at least on my machine, with an AMD processor -- is that the read program is using vectorized operations. Decompiling the two yields this for the writing program:

0000000000400610 <main>:
  ...
  400628:       e8 73 ff ff ff          callq  4005a0 <clock@plt>
  40062d:       49 89 c4                mov    %rax,%r12
  400630:       89 de                   mov    %ebx,%esi
  400632:       ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx
  400637:       48 89 ef                mov    %rbp,%rdi
  40063a:       e8 71 ff ff ff          callq  4005b0 <memset@plt>
  40063f:       0f b6 55 00             movzbl 0x0(%rbp),%edx
  400643:       b9 64 00 00 00          mov    $0x64,%ecx
  400648:       be 34 08 40 00          mov    $0x400834,%esi
  40064d:       bf 01 00 00 00          mov    $0x1,%edi
  400652:       31 c0                   xor    %eax,%eax
  400654:       48 83 c3 01             add    $0x1,%rbx
  400658:       e8 a3 ff ff ff          callq  400600 <__printf_chk@plt>

But this for the reading program:

00000000004005d0 <main>:
  ....
  400609:       e8 62 ff ff ff          callq  400570 <clock@plt>
  40060e:       49 d1 ee                shr    %r14
  400611:       48 89 44 24 18          mov    %rax,0x18(%rsp)
  400616:       4b 8d 04 e7             lea    (%r15,%r12,8),%rax
  40061a:       4b 8d 1c 36             lea    (%r14,%r14,1),%rbx
  40061e:       48 89 44 24 10          mov    %rax,0x10(%rsp)
  400623:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400628:       4d 85 e4                test   %r12,%r12
  40062b:       0f 84 df 00 00 00       je     400710 <main+0x140>
  400631:       49 8b 17                mov    (%r15),%rdx
  400634:       bf 01 00 00 00          mov    $0x1,%edi
  400639:       48 8b 74 24 10          mov    0x10(%rsp),%rsi
  40063e:       66 0f ef c0             pxor   %xmm0,%xmm0
  400642:       31 c9                   xor    %ecx,%ecx
  400644:       0f 1f 40 00             nopl   0x0(%rax)
  400648:       48 83 c1 01             add    $0x1,%rcx
  40064c:       66 0f ef 06             pxor   (%rsi),%xmm0
  400650:       48 83 c6 10             add    $0x10,%rsi
  400654:       49 39 ce                cmp    %rcx,%r14
  400657:       77 ef                   ja     400648 <main+0x78>
  400659:       66 0f 6f d0             movdqa %xmm0,%xmm2 ;!!!! vectorized magic
  40065d:       48 01 df                add    %rbx,%rdi
  400660:       66 0f 73 da 08          psrldq $0x8,%xmm2
  400665:       66 0f ef c2             pxor   %xmm2,%xmm0
  400669:       66 0f 7f 04 24          movdqa %xmm0,(%rsp)
  40066e:       48 8b 04 24             mov    (%rsp),%rax
  400672:       48 31 d0                xor    %rdx,%rax
  400675:       48 39 dd                cmp    %rbx,%rbp
  400678:       74 04                   je     40067e <main+0xae>
  40067a:       49 33 04 ff             xor    (%r15,%rdi,8),%rax
  40067e:       4c 89 ea                mov    %r13,%rdx
  400681:       49 89 07                mov    %rax,(%r15)
  400684:       b9 64 00 00 00          mov    $0x64,%ecx
  400689:       be 04 0a 40 00          mov    $0x400a04,%esi
  400695:       e8 26 ff ff ff          callq  4005c0 <__printf_chk@plt>
  40068e:       bf 01 00 00 00          mov    $0x1,%edi
  400693:       31 c0                   xor    %eax,%eax

Also, note that your "homegrown" memset is actually optimized down to a call to memset:

00000000004007b0 <my_memset>:
  4007b0:       48 85 d2                test   %rdx,%rdx
  4007b3:       74 1b                   je     4007d0 <my_memset+0x20>
  4007b5:       48 83 ec 08             sub    $0x8,%rsp
  4007b9:       40 0f be f6             movsbl %sil,%esi
  4007bd:       e8 ee fd ff ff          callq  4005b0 <memset@plt>
  4007c2:       48 83 c4 08             add    $0x8,%rsp
  4007c6:       c3                      retq   
  4007c7:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  4007ce:       00 00 
  4007d0:       48 89 f8                mov    %rdi,%rax
  4007d3:       c3                      retq   
  4007d4:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4007db:       00 00 00 
  4007de:       66 90                   xchg   %ax,%ax

I can't find any references regarding whether or not memset uses vectorized operations, the disassembly of memset@plt is unhelpful here:

00000000004005b0 <memset@plt>:
  4005b0:       ff 25 72 0a 20 00       jmpq   *0x200a72(%rip)        # 601028 <_GLOBAL_OFFSET_TABLE_+0x28>
  4005b6:       68 02 00 00 00          pushq  $0x2
  4005bb:       e9 c0 ff ff ff          jmpq   400580 <_init+0x20>

This question suggests that since memset is designed to handle every case, it might be missing some optimizations.

This guy definitely seems convinced that you need to roll your own assembler memset to take advantage of SIMD instructions. This question does, too.

I'm going to take a shot in the dark and guess that it's not using SIMD operations because it can't tell whether or not it's going to be operating on something that's a multiple of the size of one vectorized operation, or there's some alignment-related issue.

However, we can confirm that it's not an issue of cache efficiency by checking with cachegrind. The write program produces the following:

==19593== D   refs:       6,312,618,768  (80,386 rd   + 6,312,538,382 wr)
==19593== D1  misses:     1,578,132,439  ( 5,350 rd   + 1,578,127,089 wr)
==19593== LLd misses:     1,578,131,849  ( 4,806 rd   + 1,578,127,043 wr)
==19593== D1  miss rate:           24.9% (   6.6%     +          24.9%  )
==19593== LLd miss rate:           24.9% (   5.9%     +          24.9%  )
==19593== 
==19593== LL refs:        1,578,133,467  ( 6,378 rd   + 1,578,127,089 wr)
==19593== LL misses:      1,578,132,871  ( 5,828 rd   + 1,578,127,043 wr) << 
==19593== LL miss rate:             9.0% (   0.0%     +          24.9%  )

and the read program produces:

==19682== D   refs:       6,312,618,618  (6,250,080,336 rd   + 62,538,282 wr)
==19682== D1  misses:     1,578,132,331  (1,562,505,046 rd   + 15,627,285 wr)
==19682== LLd misses:     1,578,131,740  (1,562,504,500 rd   + 15,627,240 wr)
==19682== D1  miss rate:           24.9% (         24.9%     +       24.9%  )
==19682== LLd miss rate:           24.9% (         24.9%     +       24.9%  )
==19682== 
==19682== LL refs:        1,578,133,357  (1,562,506,072 rd   + 15,627,285 wr)
==19682== LL misses:      1,578,132,760  (1,562,505,520 rd   + 15,627,240 wr) <<
==19682== LL miss rate:             4.1% (          4.1%     +       24.9%  )

While the read program has a lower LL miss rate because it performs many more reads (an extra read per XOR operation), the total number of misses is the same. So whatever the issue is, it's not there.

Community
  • 1
  • 1
Patrick Collins
  • 10,306
  • 5
  • 30
  • 69
  • Are you also seeing 2-fold difference in bandwidth? Can you post your numbers and RAM configuration? – MWB Sep 14 '14 at 05:13
  • 2
    `This guy definitely seems convinced ...` His buffer is 244000 times smaller and fits in various caches. – MWB Sep 14 '14 at 05:21
  • Your memset is almost certainly vectorized to some extent; some of the smarter implementations will run a small loop up to alignment before they launch into the vectorized version. I'm guessing your on Linux, probably using glibc, so here's [its memset](https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S;hb=HEAD). (With a bit of fiddling with the GOT, or a couple of `stepi`s in GDB, you should be able to find the implementation yourself.) – saagarjha Apr 12 '20 at 03:23
9

Caching and locality almost certainly explain most of the effects you are seeing.

There isn't any caching or locality on writes, unless you want a non-deterministic system. Most write times are measured as the time it takes for the data to get all the way to the storage medium (whether this is a hard drive or a memory chip), whereas reads can come from any number of cache layers that are faster than the storage medium.

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
  • 1 GB array is much bigger than any cache size (that's why I chose it). By the time `do_xor` runs the second time, any previously cached values will have been expunged. Besides, caching could explain reading being faster than the DRAM->Cache link (if this were the case). It does not explain writing being slower. – MWB Sep 13 '14 at 23:56
  • 5
    I hope that it is self-evident that you don't need a 1GB cache to still see cache effects. – Robert Harvey Sep 14 '14 at 00:28
  • 1
    +1 -- I'm willing to bet that prefetching has something to do with it; it's not going to help those writes, but it will help the reads. I'm also willing to bet that GCC is less willing to reorder the writes than the reads. – Patrick Collins Sep 14 '14 at 01:33
  • On x86, normal stores (not `movnt`) are strongly-ordered. Writing to a cold cache-line triggers a read-for-ownership. As I understand it, the CPU really does do a read from DRAM (or lower level cache) to fill the cache line. Writes are harder than reads for a system with strongly ordered memory (like x86), but not for the reason you give. Stores are allowed to be buffered and become globally visible after loads done by the same thread. (MFENCE is a StoreLoad barrier...) AMD does use write-through caches for simplicity, but Intel uses write-back for better performance. – Peter Cordes Sep 29 '15 at 11:54
  • It's definitely true in practice that repeating a write-only loop (like memset) with a buffer that fits in L1 is faster than with a larger buffer. Part of that is that lines that are already in the M state (of MESI) don't require any other lines to be evicted (which could stall if the evicted line was in the M state and had to be written L2 first, esp. if L2 then evicted a modified line, etc. down to DRAM). But another part of that is avoiding the read-for-ownership when a cacheline is already in the E or M state. `movnt` and Fast String rep movsb weakly-ordered stores avoid the RFO. – Peter Cordes Sep 29 '15 at 11:59
6

It might be Just How it (the-System-as-a-Whole) Performs. The read being faster appears to be a common trend with a wide range of relative throughput performance. On a quick analysis of the DDR3 Intel and the DDR2 charts listed, as a few select cases of (write/read)%;

Some top performing DDR3 chips are writing at about ~60-70% of the read throughput. However, there some memory modules (ie. Golden Empire CL11-13-13 D3-2666) down to only ~30% write.

Top performing DDR2 chips appear to have only about ~50% of the write throughput compared to the read. But there are also some notably bad contenders (ie. OCZ OCZ21066NEW_BT1G) down to ~20%.

While this may not explain the cause for the ~40% write/read reported, as benchmark code and setup used is likely different (the notes are vague), this is definitely a factor. (I would run some existing benchmark programs and see if the numbers fall in-line with those of the code posted in the question.)


Update:

I downloaded the memory look-up table from the linked site and processed it in Excel. While it still shows a wide range of values it is much less sever than the original reply above which only looked at the top-read memory chips and a few selected "interesting" entries from the charts. I'm not sure why the discrepancies, especially in the terrible contenders singled out above, are not present in the secondary list.

However, even under the new numbers the difference still ranges widely from 50%-100% (median 65, mean 65) of the read performance. Do note that just because a chip was "100%" efficient in a write/read ratio doesn't mean it was better overall .. just that it was more even-keel between the two operations.

user2864740
  • 60,010
  • 15
  • 145
  • 220
  • It's unclear if they have 1 DIMM or multiple DIMMs installed. I believe that can make a very significant difference. My test is "pure" in the sense that I only have 1 DIMM. – MWB Sep 13 '14 at 22:14
  • @MaxB It isn't very clear at all, but it does show a wide range of values. That's why my recommendation would be to see if other benchmarks programs result in similar values on the particular machine; and if so, if the posted benchmark also follows suit on different hardware. – user2864740 Sep 13 '14 at 23:05
4

Here's my working hypothesis. If correct, it explains why writes are about twice slower than reads:

Even though memset only writes to virtual memory, ignoring its previous contents, at the hardware level, the computer cannot do a pure write to DRAM: it reads the contents of DRAM into cache, modifies them there and then writes them back to DRAM. Therefore, at the hardware level, memset does both reading and writing (even though the former seems useless)! Hence the roughly two-fold speed difference.

MWB
  • 11,740
  • 6
  • 46
  • 91
  • 1
    You can avoid this read-for-ownership with weakly-ordered stores (`movnt` or Intel IvB-and-later `rep stos` / `rep movs` "Fast String Operations"). It sucks that there isn't a convenient way to do weakly-ordered stores (other than memset/memcpy on recent Intel CPUs) without also bypassing the cache. I left similar comments on some other answers: the main reason for normal writes triggering reads is x86's strongly-ordered memory model. Limiting your system to one DIMM or not shouldn't be a factor in this. – Peter Cordes Sep 29 '15 at 12:07
  • I expect some other architectures, like ARM, do write at full DRAM bandwidth without any extra effort, because there's no guarantee that stores will be visible to other threads in program-order. e.g. a store to a hot cache line could happen right away (or at least, after making sure no previous instruction can fault or be a mispredicted branch), but a store to cold cache line might just get buffered without any way for other cores to see the value until the cold cache-line is fully rewritten and the store-buffer is flushed. – Peter Cordes Sep 29 '15 at 12:08
2

Because to read you simply pulse the address lines and read out the core states on the sense lines. The write-back cycle occurs after the data is delivered to the CPU and hence doesn't slow things down. On the other hand, to write you must first perform a fake read to reset the cores, then perform the write cycle.

(Just in case it's not obvious, this answer is tongue-in-cheek -- describing why write is slower than read on an old core memory box.)

Hot Licks
  • 47,103
  • 17
  • 93
  • 151