Optimizing linear access to arrays with pre-fetching and cache in C

Question

disclosure: I've tried similar question on programmers.stack, but that place is nowhere near activity stack is.

Intro

I tend to work with lots of large images. They also come in sequences of more than one and have to be processed and played back repeatedly. Sometimes I use GPU, sometimes CPU, sometimes both. Most of access patterns are linear in nature (back and forth) which got me thinking about more basic things regarding arrays and how should one approach writing code optimized for maximum memory bandwidth possible on given hardware (permitting computation isn't blocking read/write).

Test specs

I've done this on a 2011 MacbookAir4,2 (I5-2557M) with 4GB RAM and SSD. Nothing else was running during tests except iterm2.
gcc 5.2.0 (homebrew) with flags: -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 with additional include and library flags as well as framework flags in order to use glfw timer which I tend to use. I could've done it without, it doesn't matter. All 64-bit, of course.
I've tried tests with optional -fprefetch-loop-arrays flag, but it didn't seem to influence results at all

Test

Allocating two arrays of n bytes on the heap - where n is 8, 16, 32, 64, 128, 256, 512 and 1024 MB
Initialize array to 0xff, byte at a time
Test 1 - linear copy

linear copy:

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }

Test 2 - copying with stride. This is where it gets confusing. I have tried playing the pre-fetch game here. I have tried various combinations of how much should I do per loop and it seems that ~40 per loop yields best performance. Why? I have no idea. I understand that malloc in c99 with uint64_t would give me memory aligned block. I also see sizes of my L1 through L3 caches, which are higher than these 320 bytes, so what am I hitting? Clues might be later in graphs. I would really like to understand this.

stride copy:

for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }

Test 3 - reading with stride. Same as with copying with stride.

stride read:

    const int imax = 1000;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }

Test 4 - Linear reading. Byte per byte. I was surprised -fprefetch-loop-arrays yielded no results here. I thought it was for these cases.

linear reading:

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }

Test 5 - memcpy as a contrast.

memcpy:

memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));

Results

Sample output:

sample output:

Init done in 0.767 s - size of array: 1024 MBs (x2)
Performance: 1304.325 MB/s

Copying (linear) done in 0.898 s
Performance: 1113.529 MB/s

Copying (stride 40) done in 0.257 s
Performance: 3890.608 MB/s

[1000/1000] Performance stride 40: 7474.322 MB/s
Average: 7523.427 MB/s
Performance MIN: 3231 MB/s | Performance MAX: 7818 MB/s

[1000/1000] Performance dumb: 2504.713 MB/s
Average: 2481.502 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2644 MB/s

Copying (memcpy) done in 1.726 s
Performance: 579.485 MB/s

--

Init done in 0.415 s - size of array: 512 MBs (x2)
Performance: 1233.136 MB/s

Copying (linear) done in 0.442 s
Performance: 1157.147 MB/s

Copying (stride 40) done in 0.116 s
Performance: 4399.606 MB/s

[1000/1000] Performance stride 40: 6527.004 MB/s
Average: 7166.458 MB/s
Performance MIN: 4359 MB/s | Performance MAX: 7787 MB/s

[1000/1000] Performance dumb: 2383.292 MB/s
Average: 2409.005 MB/s
Performance MIN: 1673 MB/s | Performance MAX: 2641 MB/s

Copying (memcpy) done in 0.102 s
Performance: 5026.476 MB/s

--

Init done in 0.228 s - size of array: 256 MBs (x2)
Performance: 1124.618 MB/s

Copying (linear) done in 0.242 s
Performance: 1057.916 MB/s

Copying (stride 40) done in 0.070 s
Performance: 3650.996 MB/s

[1000/1000] Performance stride 40: 7129.206 MB/s
Average: 7370.537 MB/s
Performance MIN: 4805 MB/s | Performance MAX: 7848 MB/s

[1000/1000] Performance dumb: 2456.129 MB/s
Average: 2435.556 MB/s
Performance MIN: 1496 MB/s | Performance MAX: 2637 MB/s

Copying (memcpy) done in 0.050 s
Performance: 5095.845 MB/s

-- 

Init done in 0.100 s - size of array: 128 MBs (x2)
Performance: 1277.200 MB/s

Copying (linear) done in 0.112 s
Performance: 1147.030 MB/s

Copying (stride 40) done in 0.029 s
Performance: 4424.513 MB/s

[1000/1000] Performance stride 40: 6497.635 MB/s
Average: 6714.540 MB/s
Performance MIN: 4206 MB/s | Performance MAX: 7843 MB/s

[1000/1000] Performance dumb: 2275.336 MB/s
Average: 2335.544 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2626 MB/s

Copying (memcpy) done in 0.025 s
Performance: 5086.502 MB/s

-- 

Init done in 0.051 s - size of array: 64 MBs (x2)
Performance: 1255.969 MB/s

Copying (linear) done in 0.058 s
Performance: 1104.282 MB/s

Copying (stride 40) done in 0.015 s
Performance: 4305.765 MB/s

[1000/1000] Performance stride 40: 7750.063 MB/s
Average: 7412.167 MB/s
Performance MIN: 3892 MB/s | Performance MAX: 7826 MB/s

[1000/1000] Performance dumb: 2610.136 MB/s
Average: 2577.313 MB/s
Performance MIN: 2126 MB/s | Performance MAX: 2652 MB/s

Copying (memcpy) done in 0.013 s
Performance: 4871.823 MB/s

-- 

Init done in 0.024 s - size of array: 32 MBs (x2)
Performance: 1306.738 MB/s

Copying (linear) done in 0.028 s
Performance: 1148.582 MB/s

Copying (stride 40) done in 0.008 s
Performance: 4265.907 MB/s

[1000/1000] Performance stride 40: 6181.040 MB/s
Average: 7124.592 MB/s
Performance MIN: 3480 MB/s | Performance MAX: 7777 MB/s

[1000/1000] Performance dumb: 2508.669 MB/s
Average: 2556.529 MB/s
Performance MIN: 1966 MB/s | Performance MAX: 2646 MB/s

Copying (memcpy) done in 0.007 s
Performance: 4617.860 MB/s

--

Init done in 0.013 s - size of array: 16 MBs (x2)
Performance: 1243.011 MB/s

Copying (linear) done in 0.014 s
Performance: 1139.362 MB/s

Copying (stride 40) done in 0.004 s
Performance: 4181.548 MB/s

[1000/1000] Performance stride 40: 6317.129 MB/s
Average: 7358.539 MB/s
Performance MIN: 5250 MB/s | Performance MAX: 7816 MB/s

[1000/1000] Performance dumb: 2529.707 MB/s
Average: 2525.783 MB/s
Performance MIN: 1823 MB/s | Performance MAX: 2634 MB/s

Copying (memcpy) done in 0.003 s
Performance: 5167.561 MB/s

--

Init done in 0.007 s - size of array: 8 MBs (x2)
Performance: 1186.019 MB/s

Copying (linear) done in 0.007 s
Performance: 1147.018 MB/s

Copying (stride 40) done in 0.002 s
Performance: 4157.658 MB/s

[1000/1000] Performance stride 40: 6958.839 MB/s
Average: 7097.742 MB/s
Performance MIN: 4278 MB/s | Performance MAX: 7499 MB/s

[1000/1000] Performance dumb: 2585.366 MB/s
Average: 2537.896 MB/s
Performance MIN: 2284 MB/s | Performance MAX: 2610 MB/s

Copying (memcpy) done in 0.002 s
Performance: 5059.164 MB/s

Linear reading is 3 times slower than reading in strides. Stride reading maxes out at approx. 7500-7800 MB/s range. Two things confuse me though. At DDR3 1333 Mhz, max memory throughput should be 10,664 MB/s so why am I not hitting it? Why reading speed isn't more consistent and how would I optimize for that (cache misses?)? It's more apparent from graphs, especially linear reading with regular dips in performance.

Graphs

8-16 MB

32-64 MB

128-256 MB

512-1024 MB

All together

Here's full source for anyone interested:

/*
gcc -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 -I "...path to glfw3 includes ..." -L "...path to glfw3 lib ..." arr_test_copy_gnuplot.c -o arr_test_copy_gnuplot -lglfw3 -framework OpenGL -framework Cocoa -framework IOKit -framework CoreVideo

optional: -fprefetch-loop-arrays
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h> /* memcpy */
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 128 /* GIG */
int main(int argc, char *argv[]) {

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    int cx = 0;
    char filename_stride[50];
    char filename_dumb[50];
    cx = snprintf(filename_stride, 50, "%lu_stride.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_stride = fopen(filename_stride, "w");
    cx = snprintf(filename_dumb, 50, "%lu_dumb.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_dumb   = fopen(filename_dumb, "w");
    if(file_stride == NULL || file_dumb == NULL) {
        perror("Error opening file.");
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);
    uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double performance  = 0.0;
    double time_start   = 0.0;
    double time_end     = 0.0;
    double performance_min  = 0.0;
    double performance_max  = 0.0;

    /* Init array */
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Init done in %.3f s - size of array: %lu MBs (x2)\n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Linear copy */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (linear) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Copying with wide stride */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (stride 40) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Reading with wide stride */
    const int imax = 1000;
    double performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }

        printf("[%d/%d] Performance stride 40: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_stride, "%d\t%f\n", j, performance);
        fflush(file_stride);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Linear reading */
    performance_average = 0.0;
    performance_min     = 0.0;
    performance_max      = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }
        printf("[%d/%d] Performance dumb: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_dumb, "%d\t%f\n", j, performance);
        fflush(file_dumb);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Memcpy */
    performance = 0;
    time_start = glfwGetTime();
    memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (memcpy) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n", performance);

    /* Cleanup and exit */
    free(array);
    free(array_copy);
    glfwTerminate();
    fclose(file_dumb);
    fclose(file_stride);

    exit(EXIT_SUCCESS);
}

Summary

How should I write code to have maximum and (near)constant speed when working with arrays where linear access is most common pattern?
What can I learn about cache and pre-fetch from this example?
Are these graphs telling me something I should know that I haven't noticed?
How else can I unroll loops? I have tried -funroll-loops to no results, so I have resorted to manually writing loop-in-loop unrolls.

Thanks for the longs read.

EDIT:

It seems -O0 gives different performance from when -O flag is absent! What gives? Flag being absent yields better performance, as can be seen in the graph.

EDIT2:

I have finally hit the ceiling with AVX.

=== READING WITH AVX ===
[1000/1000] Performance AVX: 9868.912 MB/s
Average: 10029.085 MB/s
Performance MIN: 6554 MB/s | Performance MAX: 11464 MB/s

Average being really close to 10664. I had to change compiler to clang because gcc was giving me a hard time for using avx (-mavx). This is also why graph has more pronounced dips. I would still like to know how to / what is / have constant performance. I presume this is due to caching / cache lines. It would also explain performance going above DDR3 speed here and there (MAX was 11464 MB/s).

Excuse my gnuplot-fu and its keys. Blue is SSE2 ( _mm_load_si128 ) and orange is AVX ( _mm256_load_si256 ). Purple is strided as before and green is dumb reading one at a time.

So, final two questions are:

What is causing dips and how to have more constant performance
Is it possible to hit ceiling without intrinsics?

gist with latest version: https://gist.github.com/Keyframe/1ed9062ec52fc4a0d14b and graphs from that version: https://i.stack.imgur.com/LFZy3.jpg

In this case, turning on optimizations (-O3 or any other for that matter) yields weird results, skips code, etc. Turning them off completely I hope to get a clearer picture of what's going on once I dwell into asm output. — Keyframe, Nov 12 '15 at 19:49
If you don't want the compiler to reorder your memory accesses, mark the memory region you write to as `volatile`. — fuz, Nov 12 '15 at 19:51
@Keyframe: Without optimizations, you're not measuring anything worth mentioning. With `-O3` and `-march=native`, `gcc` will do stuff like use SIMD instructions when appropriate (not all the time, so GCC compiler intrinsics are sometimes useful to force it). With optimizations off, it won't do that, nor any other obvious improvements. `volatile` targets isn't really the answer either, since that can lead to unnecessary memory barriers on some architectures that completely destroy any attempts at performing a speedup. — ShadowRanger, Nov 12 '15 at 23:21
One simple approach you might take to ensure it actually does the work you're trying to time is to add, after the timed loop, a second loop that just xor sums the target array (which could be done using `volatile` access since the goal is to force gcc to do the work, not time it). — ShadowRanger, Nov 12 '15 at 23:23
@ShadowRanger I'm not sure about that. My goal is to learn how to write for cache and pre-fetching in order to understand it. Hence why no -03 because in this case it would just circumvent most of the stuff and/or not work. See my edits - I was able to hit max bandwidth, but via intrinsics. Now my goal is to eliminate dips in performance. Once I understand what needs to be going on in the background then I can rely on compiler optimizations. Not as a magic though, but as a tool to help once I already put my code in proper place. — Keyframe, Nov 13 '15 at 00:14
I hate to say this, but compiling without optimizations not the solution to, "-O3 is weird". If optimization is circumventing your stuff, you need to change your code so that it can't. That means using the results of the computation so that the compiler can't delete it. Or switching to a lower-level optimization that doesn't do crazy loop transformations that change your access pattern. When you compile without optimizations at all, it's possible that it will be so slow that it completely hides any differences in the memory access. — Mysticial, Nov 13 '15 at 00:26
I did try various optimization levels. I have also 'consumed' data. It didn't change the fact about what's happening. Example code here is a gist of what I'm trying to learn because in application code I had memory bandwidth issues, even with max optimizations from compiler. Manually unrolling a loop helped me with that and sparked mh interwst to learn more or in-depth. Hence why let's do it without compiler's help - because there weren't any. Also, -O0 vs no -O at all - what's up with that? Next time I will ask the same question, but in assembly and with asm example. — Keyframe, Nov 13 '15 at 01:14

score 7 · Accepted Answer · edited May 23 '17 at 11:54

7

Your value for the peak bandwidth from main memory is off by a factor of two. Instead of it 10664 MB/s it should be 21.3 GB/s (more precisely it should be (21333⅓) MB/s - see my derivation below). The fact that you see more than 10664 MB/s sometimes should have told you that maybe there was a problem in your peak bandwidth calculation.

In order to get the maximum bandwidth for Core2 through Sandy Bridge you need to use non-temporal stores. Additionally, you need multiple threads. You don't need AVX instructions or to unroll the loop.

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
    }
}

The arrays need to be 16 byte aligned and also be a multiple of 16. The rule of thumb for non-temporal stores is to use them when the memory you are copying is larger than half the size of last level cache. In your case half the L3 cache size is 1.5 MB and the smallest array you copy is 8 MB so this is much larger than half the last level cache size.

Here is some code to test this.

//gcc -O3 -fopenmp foo.c
#include <stdio.h>
#include <x86intrin.h>
#include <string.h>
#include <omp.h>

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

void copy2(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

int main(void)
{
    unsigned n = 0x7fffffff;
    char *x = _mm_malloc(n, 16);
    char *y = _mm_malloc(n, 16);
    double dtime;

    memset(x,0,n);
    memset(y,1,n);

    dtime = -omp_get_wtime();
    copy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    copy2(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    memcpy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);  
}

On my system, Core2 (before Nehalem) P9600@2.53GHz, it gives

time non temporal store 0.39
time SSE store          1.10
time memcpy             0.98

to copy 2GB.

Note that it's very important that you "touch" the memory you will write to first (I used memset to do this). Your system does not necessarily allocate your memory until you access it. The overhead to do this can bias your results significantly if the memory has not been accesses when you do the memory copy.

According to wikipedia DDR3-1333 has a memory clock of 166⅔ MHz. DDR transfers data at twice memory clock rate. Additionally, DDR3 has a bus clock multiplier of four. So DDR3 has a total multiply per memory clock of eight. Additionally, your motherboard has two memory channels. So the total transfer rate is

 21333⅓ MB/s = (166⅔ 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).

edited May 23 '17 at 11:54

Community

1
1

answered Nov 15 '15 at 11:09

Z boson

32,619
11
123
226

Alright, this is a type of answer I was hoping for, to learn from! How did you arrive at 21328 MB/s? I did a DDR3 frequency (1333) times 8. Somehow, I presumed larger values were coming from cache speeds. – Keyframe Nov 15 '15 at 17:32
@Keyframe, I looked it up. It's in the first link in my answer "Maximum memory bandwidth (GB/s): 21.3". I think the extra factor of 2 comes from the fact that your computer is dual-channel (some higher end chipsets and motherboards support triple and quad channel). – Z boson Nov 15 '15 at 18:57
1

@Keyframe, the formula is `((166+2/3)*1E6 clocks/sec)*(8 lines/clock/channel)*(2channels)*(64-bits/line)*(byte/8-bits) = 21333 1E6 bytes/sec` (notice that the dimensional analysis works out correctly). So I guess my number is slightly off. I wrote 21328 MB/sec. A slightly more acurate value is 21333 MB/s (the data rate of your DDR3 memory is really `1333+1/3 = (166+2/3)*8`). – Z boson Nov 15 '15 at 19:15
Damn, @Z boson - it will take me some time to decipher that formula :) Where does (166+2/3) come from and clocks/sec, lines/clock/channel values are? I thought 21.3 was coming from a speed of a memory controller, not actual memory that it interfaces with (DDR3). Dual channel makes sense then, even with trivial calculation of (1333 MhZ * 8 bits) * 2 channels. – Keyframe Nov 15 '15 at 20:18
@Keyframe, I added some text to my answer to explain the bandwidth calculation. I also fixed the code to work when the number of threads is not a power of 2 (e.g six threads). On your system OpenMP should default to 4 threads but I wanted my answer to be okay in general. – Z boson Nov 16 '15 at 10:04
@Keyframe, I need to correct my last comment. What I actually did was fix the code so that each thread operates on a multiple of 16 elements. This is because the stream instruction needs to be 16 byte aligned each thread must operate on 16 elements to stay aligned. – Z boson Nov 17 '15 at 08:07
I figured that on my own prior to your edit. Imagine that! :) Thanks again! – Keyframe Nov 17 '15 at 11:23
@Keyframe, small mistake again...I keep being inaccurate. I mean "each thread must operate on a *multiple* of 16 elements to stay aligned." I'm much better at writing code than at writing words. Note that the load does not need to be aligned. You could use`_mm_loadu_ps`. – Z boson Nov 17 '15 at 11:36
@Keyframe, btw, I really appreciate your plots. If you make a new plot using the non-temporal stores from my answer could you add that plot to your question please? Have you tested this? I think it should get you within about 80% of the peak. Maybe 18 GB/s. – Z boson Nov 17 '15 at 11:40
I'm not near my laptop or workstation now to make a graph, but I will update. There's a full source on gist in my last edit with commented gitplot commands to produce those graphs. I did an initial test when you posted, but performance was actually worse. I will do a test without openmp and with tinycthreads, which I tend to use and without ps. I think there's a streaming alternative for integers. If not, then with ps. I wont use unaligned load. Maybe for another test run. I also got some feedback out of stackoverflow. – Keyframe Nov 17 '15 at 12:43
@Keyframe, I updated my answer with code to test the performance. On the system I tested this on using non-temporal stores was 2.5 times faster than `memcpy`. – Z boson Nov 21 '15 at 15:32

score 3 · Answer 2 · answered Nov 12 '15 at 17:05

3

For the things you are doing I would look at SIMD (Single Instruction Multiple Data), google for GCC Compiler Intrinsics for details

answered Nov 12 '15 at 17:05

Marek Waszkiewicz

305
2
10

That's a reasonable approach. I forgot to mention that I haven't tried that with this, because I thought that somehow magically one could maximie throughput without special instructions? – Keyframe Nov 12 '15 at 17:06
I think your answer should really have been left as a comment. The throughput for memory copy for sizes larger than the last level cache is memory bandwidth bound. The memory reads are much slower than the CPU so SIMD is not going to help much (which is why AVX and unrolling are unnecessary). The only reason I used SSE in my answer was to get access to the non-temporal store instructions. – Z boson Nov 15 '15 at 15:56

score 3 · Answer 3 · edited May 23 '17 at 12:00

3

You should compile with a recent GCC (so having compiled your GCC 5.2 is a good idea, in November 2015), and you should enable optimizations for your particular platform, so I suggest compiling with gcc -Wall -O2 -march=native at least (try also to replace -O2 with -O3).

^{(Don't benchmark your programs without enabling optimizations in your compiler)}

If you are concerned with cache effects, you might play with __builtin_prefetch, but see this.

Read also about OpenMP, OpenCL, OpenACC.

edited May 23 '17 at 12:00

Community

1
1

answered Nov 15 '15 at 15:44

Basile Starynkevitch

223,805
18
296
547

Of course the OP should have using at least `-O2`. But GCC's built in `memcpy` and glibc's `memcpy` do not use non-temporal stores and so they don't get anywhere close to the peak bandwidth. The complier won't use non-temporal stores unless you explicitly tell it to. – Z boson Nov 15 '15 at 16:29

Optimizing linear access to arrays with pre-fetching and cache in C

3 Answers3

Linked