25

I was inspired by this question to write a simple program to test my machine's memory bandwidth in each cache level:

Why vectorizing the loop does not have performance improvement

My code uses memset to write to a buffer (or buffers) over and over and measures the speed. It also saves the address of every buffer to print at the end. Here's the listing:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE_KB {8, 16, 24, 28, 32, 36, 40, 48, 64, 128, 256, 384, 512, 768, 1024, 1025, 2048, 4096, 8192, 16384, 200000}
#define TESTMEM 10000000000 // Approximate, in bytes
#define BUFFERS 1

double timer(void)
{
    struct timeval ts;
    double ans;

    gettimeofday(&ts, NULL);
    ans = ts.tv_sec + ts.tv_usec*1.0e-6;

    return ans;
}

int main(int argc, char **argv)
{
    double *x[BUFFERS];
    double t1, t2;
    int kbsizes[] = SIZE_KB;
    double bandwidth[sizeof(kbsizes)/sizeof(int)];
    int iterations[sizeof(kbsizes)/sizeof(int)];
    double *address[sizeof(kbsizes)/sizeof(int)][BUFFERS];
    int i, j, k;

    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
        iterations[k] = TESTMEM/(kbsizes[k]*1024);

    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
    {
        // Allocate
        for (j = 0; j < BUFFERS; j++)
        {
            x[j] = (double *) malloc(kbsizes[k]*1024);
            address[k][j] = x[j];
            memset(x[j], 0, kbsizes[k]*1024);
        }

        // Measure
        t1 = timer();
        for (i = 0; i < iterations[k]; i++)
        {
            for (j = 0; j < BUFFERS; j++)
                memset(x[j], 0xff, kbsizes[k]*1024);
        }
        t2 = timer();
        bandwidth[k] = (BUFFERS*kbsizes[k]*iterations[k])/1024.0/1024.0/(t2-t1);

        // Free
        for (j = 0; j < BUFFERS; j++)
            free(x[j]);
    }

    printf("TESTMEM = %ld\n", TESTMEM);
    printf("BUFFERS = %d\n", BUFFERS);
    printf("Size (kB)\tBandwidth (GB/s)\tIterations\tAddresses\n");
    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
    {
        printf("%7d\t\t%.2f\t\t\t%d\t\t%x", kbsizes[k], bandwidth[k], iterations[k], address[k][0]);
        for (j = 1; j < BUFFERS; j++)
            printf(", %x", address[k][j]);
        printf("\n");
    }

    return 0;
}

And the results (with BUFFERS = 1):

TESTMEM = 10000000000
BUFFERS = 1
Size (kB)   Bandwidth (GB/s)    Iterations  Addresses
      8     52.79               1220703     90b010
     16     56.48               610351      90b010
     24     57.01               406901      90b010
     28     57.13               348772      90b010
     32     45.40               305175      90b010
     36     38.11               271267      90b010
     40     38.02               244140      90b010
     48     38.12               203450      90b010
     64     37.51               152587      90b010
    128     36.89               76293       90b010
    256     35.58               38146       d760f010
    384     31.01               25431       d75ef010
    512     26.79               19073       d75cf010
    768     26.20               12715       d758f010
   1024     26.20               9536        d754f010
   1025     18.30               9527        90b010
   2048     18.29               4768        d744f010
   4096     18.29               2384        d724f010
   8192     18.31               1192        d6e4f010
  16384     18.31               596         d664f010
 200000     18.32               48          cb2ff010

I can easily see the effect of the 32K L1 cache and 256K L2 cache. What I don't understand is why performance drops suddenly after the size of the memset buffer exceeds 1M. My L3 cache is supposed to be 8M. It happens so suddenly too, not tapered at all like when the L1 and L2 cache size was exceeded.

My processor is the Intel i7 3700. The details of the L3 cache from /sys/devices/system/cpu/cpu0/cache are:

level = 3
coherency_line_size = 64
number_of_sets = 8192
physical_line_partition = 1
shared_cpu_list = 0-7
shared_cpu_map = ff
size = 8192K
type = Unified
ways_of_associativity = 16

I thought I would try using multiple buffers - call memset on 2 buffers of 1M each and see if performance would drop. With BUFFERS = 2, I get:

TESTMEM = 10000000000
BUFFERS = 2
Size (kB)   Bandwidth (GB/s)    Iterations  Addresses
      8     54.15               1220703     e59010, e5b020
     16     51.52               610351      e59010, e5d020
     24     38.94               406901      e59010, e5f020
     28     38.53               348772      e59010, e60020
     32     38.31               305175      e59010, e61020
     36     38.29               271267      e59010, e62020
     40     38.29               244140      e59010, e63020
     48     37.46               203450      e59010, e65020
     64     36.93               152587      e59010, e69020
    128     35.67               76293       e59010, 63769010
    256     27.21               38146       63724010, 636e3010
    384     26.26               25431       63704010, 636a3010
    512     26.19               19073       636e4010, 63663010
    768     26.20               12715       636a4010, 635e3010
   1024     26.16               9536        63664010, 63563010
   1025     18.29               9527        e59010, f59420
   2048     18.23               4768        63564010, 63363010
   4096     18.27               2384        63364010, 62f63010
   8192     18.29               1192        62f64010, 62763010
  16384     18.31               596         62764010, 61763010
 200000     18.31               48          57414010, 4b0c3010

It appears that both 1M buffers stay in the L3 cache. But try to increase the size of either buffer ever so slightly and the performance drops.

I've been compiling with -O3. It doesn't make much difference (except possibly unrolling the loops over BUFFERS). I tried with -O0 and it's the same except for the L1 speeds. gcc version is 4.9.1.

To summarize, I have a 2-part question:

  1. Why does my 8 MB L3 cache not provide any benefit on blocks of memory larger than 1M?
  2. Why is the drop in performance so sudden?

EDIT:

As suggested by Gabriel Southern, I ran my code with perf using BUFFERS=1 with only one buffer size at a time. This was the full command:

perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -r 100 ./a.out 2> perfout.txt

The -r means that perf will run a.out 100 times and return the average statistics.

The output of perf, with #define SIZE_KB {1024}:

 Performance counter stats for './a.out' (100 runs):

         1,508,798 dTLB-loads                                                    ( +-  0.02% )
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits 
       625,967,550 dTLB-stores                                                   ( +-  0.00% )
             1,503 dTLB-store-misses                                             ( +-  0.79% )

       0.360471583 seconds time elapsed                                          ( +-  0.79% )

and with #define SIZE_KB {1025}:

 Performance counter stats for './a.out' (100 runs):

         1,670,402 dTLB-loads                                                    ( +-  0.09% )
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits 
       626,099,850 dTLB-stores                                                   ( +-  0.00% )
             2,115 dTLB-store-misses                                             ( +-  2.19% )

       0.503913416 seconds time elapsed                                          ( +-  0.06% )

So there does seem to be more TLB misses with the 1025K buffer. However, with this size buffer, the program does about 9500 calls of memset, so it is still less than 1 miss per memset call.

Community
  • 1
  • 1
hewy
  • 275
  • 3
  • 8
  • Is it not the other way around, the L2 cache of 1MB helps up to the point where you go over 1MB? – Mats Petersson May 18 '15 at 22:15
  • My L2 cache is 256K per core. Each core can only access its 256K. – hewy May 18 '15 at 22:21
  • Can you please just calculate `kbsizes[k]*1024` once and stick it in a `const` variable? Thus removing any chance that one of the usages got the calculation slightly wrong. – Ben Voigt May 18 '15 at 22:34
  • 2
    Is this an aliasing thing? Perhaps the mapping of address to cache line is such that each MB of a contiguous buffer aliases to the same MB in cache, whereas in your 2-buffer scenario, perhaps the high-order bits map it to elsewhere. (I have no idea what mapping function is used in your particular processor...) – Oliver Charlesworth May 18 '15 at 22:45
  • L3 is a shared cache in Intel's x86 processors. Some of the 8 MB may be in use by another core. – EOF May 18 '15 at 22:50
  • @BenVoigt I'm not sure what you mean. I don't think it can be `const` because it would have to change with each k. Did you think I should make another array with the sizes in bytes? Looking at the code, I count 4 places where I could make such a mistake: malloc, 2 memsets, and the calculation of bandwidth. It looks right to me... Also it did work correctly for L1 and L2... – hewy May 18 '15 at 22:56
  • @EOF Nothing else is running on the machine. I am logged in remotely and that's it. Just some system processes. – hewy May 18 '15 at 22:57
  • 1
    @OliverCharlesworth I wondered about that. But the L3 cache should be 16-way associative, meaning the critical stride is 0.5M. So to fit a 1M array in there, it had to use 2 ways. The second 0.5M would be getting mapped to the same places at the first 0.5M. – hewy May 18 '15 at 23:02
  • Your cache is 8M but 16-way associative, which means you'll start aliasing after the first 500k has been filled. – Ron Kuper May 19 '15 at 00:22
  • @RonKuper since the cache is 16-way set associative, aliasing doesn't cause a problem unless 17 or more addresses alias to the same location which does not happen in this code. – Gabriel Southern May 19 '15 at 01:21
  • 1
    when you address the L3 cache, you're also addressing the L1 and L2 caches. Perhaps the slowdown you see is from thrashing the L1 cache. – Mike Crawford May 19 '15 at 03:58
  • 1
    @hewy: you're right. Unless, of course, the mapping is such that each 64kB chunk is being mapped to the same lines (in which case we exhaust the ways after 1MB). Unlikely though... – Oliver Charlesworth May 19 '15 at 07:24
  • @MichaelCrawford: but he already sees that slowdown at the 32-36kB step. – Oliver Charlesworth May 19 '15 at 07:28
  • @OliverCharlesworth I'm with you, it's unlikely. I had noticed in the Intel optimization guide that the L3 cache is divided into N "slices" with N=number of processors. I have a quad core with hyperthreading, so it appears to the system at N=8. In the cache it still might be N=4 though. If it is N=8, each slice is 1M, and then maybe, just maybe, what you said might happen. It just seems, well, stupid of Intel. I don't know how to see the size of each slice. Also I would expect a gradual slowdown at least over the next 1024/16=64 kB, much like the other transitions if this were the case. – hewy May 19 '15 at 15:24
  • @OliverCharlesworth If I don't exceed the cache size by more than a factor of 1/(ways of associativity), then even if the processor is using a LRU eviction policy, some of the data from the beginning of the array will stay in cache because the first way in cache will not get completely evicted. Sorry ran out of space in the previous comment. – hewy May 19 '15 at 15:25
  • 3
    Do you have perf installed? If so can you try running `$perf2 stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses` with the 1024 and the 1025 test cases and seeing if there is a significant difference in TLB misses? I can't reproduce the behavior you described with my system, but I think the L2 TLB for your CPU has 512 entries and the default page size is 4KB. So this might be something that explains the behavior you are seeing. If my theory is correct and you do notice a difference I'll post an answer with what I think is happening. – Gabriel Southern May 19 '15 at 18:51
  • @GabrielSouthern I don't know about `perf2`, but I have `perf`. Take a look at my edit? – hewy May 19 '15 at 23:01
  • Can you show the assembly code for the memset area? – Leeor May 19 '15 at 23:06
  • @hewy I mean to write `perf` – Gabriel Southern May 19 '15 at 23:13
  • the results from perf don't seem to show any thrashing in the TLB. Can I ask what OS you are using? I have access to a system with a `Intel Core i5-2500K CPU` that is running Arch Linux and I can't reproduce the behavior you described on that system. But I was also able to run the code on a system with a `Intel Core i7-2600` CPU that is running Ubuntu 10.04 and it has similar behavior when transitioning between 1024 and 1025 test cases. Unfortunately I can't run perf or install anything on that system. – Gabriel Southern May 19 '15 at 23:24
  • @GabrielSouthern, The TLB explanation seems plausible, but maybe you should look for other values to account for code/stack/system pages as well. Also, since you don't reproduce this on your SandyBridge, here's a possible culprit that could possibly explain this difference - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ – Leeor May 19 '15 at 23:36
  • 1
    By the way, the reason I was asking for the memset is that some libraries may implement an alternative flow for ranges beyond some arbitrary threshold, using for example rep stos. – Leeor May 19 '15 at 23:55
  • @Leeor I thought the TLB explanation made sense, but the data from running perf doesn't appear to support it. The code size of the binary is very small so it should easily fit in the i-cache, so don't think that will interfere with other cache levels. The blog post you linked to is interesting. There's obviously been changes in the CPU architecture, but it would be interesting to determine definitively what the cause of this behavior is. – Gabriel Southern May 19 '15 at 23:57
  • @Leeor, see my edit2 for the assembly. I probably put more than I needed to, but I can clean it up later. – hewy May 20 '15 at 01:16
  • @GabrielSouthern, About the TLB explanation, wouldn't 512*(4K) = 2M? Then I would hypothetically see the slowdown with an array of 2M? Which I think would then be taken care of by a "superpage" entry that I think is usually 2M/4M? I'm not sure I understand where you were going with that. Could you give a quick explanation? – hewy May 20 '15 at 04:57
  • @hewy you are right that 512*4K should be 2MB. I read somewhere that the size was 1MB, but that's obviously not correct. There might be thrashing based on how the addresses are assigned, but I'm not certain. Anyway I posted more details of what I know for now as an answer. – Gabriel Southern May 20 '15 at 05:28
  • related http://stackoverflow.com/a/26959685/332733 – Mgetz May 20 '15 at 17:26

3 Answers3

16

Short Answer:

Your version of memset starts using non-temporal stores when initializing a region of memory larger than 1 MB. As a result the CPU does not store these lines in its cache, even though your L3 cache is larger than 1 MB. Consequently the performance is limited by the available memory bandwidth in the system for buffer values larger than 1 MB.

Details:

Background:

I tested the code you provided on several different systems and initially focused on investigating the TLB because I thought that there might be thrashing in the 2nd level TLB. However, none of the data I collected confirmed that hypothesis.

Some of the systems that I tested used Arch Linux which has the latest version of glibc, while others used Ubuntu 10.04 which uses an older version of eglibc. I was able to reproduce the behavior described in the question when using a statically linked binary when testing with multiple different CPU architectures. The behavior that I focused on was a significant difference in runtime between when SIZE_KB was 1024 and when it was 1025. The performance difference is explained by a change in the code executed for the slow and fast versions.

Assembly Code

I used perf record and perf annotate to collect a trace of the executing assembly code to see what the hot code path was. The code is displayed below using the following format:

percentage time executing instruction | address | instruction.

I've copied the hot loop from the shorter version that omits most of the address and has a line connecting the loop back edge and loop header.

For the version compiled on Arch Linux the hot loop was (for both 1024 and 1025 sizes):

  2.35 │a0:┌─+movdqa %xmm8,(%rcx)
 54.90 │   │  movdqa %xmm8,0x10(%rcx)
 32.85 │   │  movdqa %xmm8,0x20(%rcx)
  1.73 │   │  movdqa %xmm8,0x30(%rcx)
  8.11 │   │  add    $0x40,%rcx      
  0.03 │   │  cmp    %rcx,%rdx       
       │   └──jne    a0

For the Ubuntu 10.04 binary the hot loop when running with a size of 1024 was:

       │a00:┌─+lea    -0x80(%r8),%r8
  0.01 │    │  cmp    $0x80,%r8     
  5.33 │    │  movdqa %xmm0,(%rdi)  
  4.67 │    │  movdqa %xmm0,0x10(%rdi)
  6.69 │    │  movdqa %xmm0,0x20(%rdi)
 31.23 │    │  movdqa %xmm0,0x30(%rdi)
 18.35 │    │  movdqa %xmm0,0x40(%rdi)
  0.27 │    │  movdqa %xmm0,0x50(%rdi)
  3.24 │    │  movdqa %xmm0,0x60(%rdi)
 16.36 │    │  movdqa %xmm0,0x70(%rdi)
 13.76 │    │  lea    0x80(%rdi),%rdi 
       │    └──jge    a00    

For the Ubuntu 10.04 version running with a buffer size of 1025 the hot loop was:

       │a60:┌─+lea    -0x80(%r8),%r8  
  0.15 │    │  cmp    $0x80,%r8       
  1.36 │    │  movntd %xmm0,(%rdi)    
  0.24 │    │  movntd %xmm0,0x10(%rdi)
  1.49 │    │  movntd %xmm0,0x20(%rdi)
 44.89 │    │  movntd %xmm0,0x30(%rdi)
  5.46 │    │  movntd %xmm0,0x40(%rdi)
  0.02 │    │  movntd %xmm0,0x50(%rdi)
  0.74 │    │  movntd %xmm0,0x60(%rdi)
 40.14 │    │  movntd %xmm0,0x70(%rdi)
  5.50 │    │  lea    0x80(%rdi),%rdi 
       │    └──jge    a60

The key difference here is that the slower version was using movntd instructions while the faster versions used movdqa instructions. The Intel Software Developers manual says the following about non-temporal stores:

For WC memory type in particular, the processor never appears to read the data into the cache hierarchy. Instead, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache.

So this seems to explain the behavior where using memset with values larger than 1 MB don't fit in the cache. The next question is why there is a difference between the Ubuntu 10.04 system and the Arch Linux system, and why 1 MB is selected as a cutoff point. To investigate that question I looked at the glibc source code:

Source code for memset

Looking at the glibc git repo at sysdeps/x86_64/memset.S the first commit I found interesting was b2b671b677d92429a3d41bf451668f476aa267ed

The commit description is:

Faster memset on x64

This implementation speed up memset in several ways. First is avoiding expensive computed jump. Second is using fact that arguments of memset are most of time aligned to 8 bytes.

Benchmark results on: kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_result27_04_13.tar.bz2

And the website referenced has some interesting profiling data.

The diff of the commit shows that the code for memset is simplified a lot and the non-temporal stores are removed. This matches what the profiled code from Arch Linux shows.

Looking at the older code I saw that the choice of whether to use non-temporal stores appeared to make use of a value described as The largest cache size

L(byte32sse2_pre):

    mov    __x86_shared_cache_size(%rip),%r9d  # The largest cache size
    cmp    %r9,%r8
    ja     L(sse2_nt_move_pre)

The code for calculating this is in: sysdeps/x86_64/cacheinfo.c

Although it looks like there is code for calculating the actual shared cache size, the default value is also 1 MB:

long int __x86_64_shared_cache_size attribute_hidden = 1024 * 1024;

So I suspect that either the default value is being used, but there may be some other reason that the code is selecting 1MB as the cutoff point.

In either case the overall answer to your question appears to be that the version of memset on your system is using non-temporal stores when setting a region of memory larger than 1 MB.

Gabriel Southern
  • 9,602
  • 12
  • 56
  • 95
  • I like this answer, but I'm not quite ready to accept it as is. I think the assembly you printed from gcc4.4 shows what is going on. In the 1025 version, movntd is a non-temporal store, meaning that the cache line containing that memory is not loaded into cache, and will not be available in cache for the next iteration. In both the fast versions (1024 and ArchLinux), movdqa is used, which causes the cache line to be loaded. So for some reason, on arrays larger than 1M, `memset` goes with non-temporal stores. I think now the question is why/how to fix on my machine and machines like it. – hewy May 20 '15 at 06:05
  • I think you are right about the non-temporal store too. I was looking for a microarchitecture explanation and I didn't look at the difference in the assembly that closely. I'll edit the answer tomorrow. – Gabriel Southern May 20 '15 at 06:14
  • Actually, to guess the answer to my own question, I bet `memset` uses nt stores after 1M because someone figured it wasn't worth killing 1M of cache with a huge call to memset. I bet you can fix it by writing your own memset, probably with intrinsics. Looking forward to seeing your edit. Thanks for helping. – hewy May 20 '15 at 06:27
  • @hewy I've edited my answer and I think this is a better explanation of what's happening (I posted my previous answer because I had some data but I wasn't really satisfied with my theory). Thanks for asking an interesting question, I learned a few things in the process of trying to answer it. – Gabriel Southern May 20 '15 at 16:41
3

Given Gabriel's disassembly of the generated assembly code, I think this is indeed the problem [Edit: his answer was edited, it now appears as the root cause so we're in agreement]:

Note that movnt is a streaming store, which may have (depending on the exact micro-architectural implementation) several impacts:

  1. Has weak ordering semantics (which allows it to be faster).
  2. Has improved latency if it overwrites a full line (no need to fetch previous data and merge).
  3. Has a non-temporal hint, making it uncacheable.

#1 and #2 may improve the latency and bandwidth of these operations if they're memory bound, but #3 basically forces them to be memory bound even if they could fit in some cache level. This probably surpasses the benefits, since the memory latency/BW are significantly worse to begin with.

So, your memset library implementation is probably using a wrong threshold for switching into streaming-stores version (I guess it doesn't bother checking your LLC size, but assuming 1M is memory resident is quite strange). I suggest trying alternative libraries, or disabling the compiler ability to generate them (if it's supported).

Leeor
  • 19,260
  • 5
  • 56
  • 87
0

Your benchmark is only writing to memory, never reading, using memset which is probably cleverly designed to not read anything from cache into memory. It may very well be that with this code, where you only use half of the capability of cache memory, there is just no performance gain compared to raw memory. The fact that writing to raw memory is quite close to L2 speed may be a hint. If L2 runs at 26 GB/sec, main memory at 18 GB/sec, what can you really expect for L3 cache?

You are measuring throughput, not latency. I'd try a benchmark where you actually use the strength of L3 cache, supplying data with lower latency than main memory.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • The question is why is there a large performance difference between a buffer size of 1024 KB and 1025 KB. – Gabriel Southern May 20 '15 at 05:32
  • Here is my interpretation of the speeds: In L1, speed is determined by clock speed. The CPU can sustain 1 16 byte write to L1 each cycle (Intel Optimization Manual). For me, that means max write speed is between (3.4 GHz)*(16 bytes) = 54.4 GB/s and (3.9 GHz)*(16 bytes) = 62.4 GB/s. There is some overhead calling memset in the dynamically linked library and I'm not sure what Intel's Turboboost does with my clock speed so I am ok with that. The speed of L2 is then ~38 GB/s, L3 is ~26 GB/s and main memory is ~18 GB/s. These speeds are limited by how fast memory can be loaded into L1. – hewy May 20 '15 at 06:11