1

I'm trying to measure the write bandwidth of my memory, I created an 8G char array, and call memset on it with 128 threads. Below is the code snippet.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <pthread.h>
int64_t char_num = 8000000000;
int threads = 128;
int res_num = 62500000;

uint8_t* arr;

static inline double timespec_to_sec(struct timespec t)
{
    return t.tv_sec * 1.0 + t.tv_nsec / 1000000000.0;
}

void* multithread_memset(void* val) {
    int thread_id = *(int*)val;
    memset(arr + (res_num * thread_id), 1, res_num);
    return NULL;
}

void start_parallel()
{
    int* thread_id = malloc(sizeof(int) * threads);
    for (int i = 0; i < threads; i++) {
        thread_id[i] = i;
    }
    pthread_t* thread_array = malloc(sizeof(pthread_t) * threads);
    for (int i = 0; i < threads; i++) {
        pthread_create(&thread_array[i], NULL, multithread_memset, &thread_id[i]);
    }
    for (int i = 0; i < threads; i++) {
        pthread_join(thread_array[i], NULL);
    }
}

int main(int argc, char *argv[])
{
    struct timespec before;
    struct timespec after;
    float time = 0;
    arr = malloc(char_num);

    clock_gettime(CLOCK_MONOTONIC, &before);
    start_parallel();
    clock_gettime(CLOCK_MONOTONIC, &after);
    double before_time = timespec_to_sec(before);
    double after_time = timespec_to_sec(after);
    time = after_time - before_time;
    printf("sequential = %10.8f\n", time);
    return 0;
}

According to the output, it took 0.6 second to finish all memset, to my understanding, this implies a 8G/0.6 = 13G memory write bandwith. However, I have a 2667 MHz DDR4 which should have a 21.3 GB/s bandwith. Is there anything wrong with my code or my calculation? Thanks for any help!!

Jerry
  • 21
  • 1
  • 6
  • You are assuming that all threads run on different CPUs and that all threads are CPU bound. But also, you have provided only one decimal point of precision. So 0.6 might be anything from 0.550 to 0.649 or anything between 12.3 GB/s and 14.5 GB/s. So measuring to only one decimal point gives over 2 GB/s of variation. – Cheatah Sep 28 '21 at 00:09
  • 1
    For one thing, `memset` won't do only write cycles. The first write instruction in each cache line will necessarily read that line into cache, because the CPU doesn't know you will later overwrite all of it. – Nate Eldredge Sep 28 '21 at 01:02
  • 1
    Also, 128 threads is a lot, unless you have 128 cores. The time spent context switching between them may be significant. – Nate Eldredge Sep 28 '21 at 01:05
  • 8e10 is not 8G. 8G is 8*1024*1024*1024 – 0___________ Sep 28 '21 at 01:18
  • 1
    If you want to prevent reading the cache line into the CPU cache, you may want to take a look at [non-temporal writes](https://stackoverflow.com/q/37070/12149471). You don't have to write assembler code for this. You can also use [compiler intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/). – Andreas Wenzel Sep 28 '21 at 01:45
  • Thanks for the replies, so is there any simple way to test the memory write bandwidth? I was thinking about using memcpy, but that would require a lot of read operation as well. – Jerry Sep 28 '21 at 02:45
  • @0___________ 8GB is 8,000,000,000 bytes based on the international system (the SI but not the IEC). This is especially true without units. For bytes not everyone agree with that but but this is a confusing unit. 8 GiB is 8*1024*1024*1024 bytes. Everyone agree on this (despite not everyone want to use this notation, especially hardware vendor for obvious marketing reasons). See [this](https://en.wikipedia.org/wiki/Binary_prefix) for more information. – Jérôme Richard Sep 28 '21 at 11:25
  • @Jerry Note that the frequency of your RAM may not match with the one of your processor. Note also that doing the memory write only once is not great because of frequency scaling and other possible issues. – Jérôme Richard Sep 28 '21 at 11:31

1 Answers1

6

TL;DR: Measuring the memory bandwidth is not easy. In your case, the performance problem probably comes from the page faults.

If you want to measure the memory write bandwidth, you need to be careful about multiples things:

  • On Intel/AMD x86 platforms, memory writes in location not fetched in the cache cause a write allocate : data at the missed-write location is loaded to cache. See this page for more information. This strategy enable to processor to fill the part of the cache line that is not written in order to ensure the consistency of CPU caches. However, this also means that half of the memory throughput is "wasted". In practice, the situation is even worst because the interleave memory read-write often introduces additional overhead. One solution to fix that is to use non-temporal write instruction. In SSE, you can use _mm_stream_* intrinsics (typically _mm_stream_si128). In AVX, this is the _mm256_stream_* intrinsics (typically _mm256_stream_si256). Note that such instructions are good to use only if the data chunks do not fit in the cache or are not reused soon after that. A good libc implementation should use such instructions for memset and memcpy on big chunks.

  • Most operating systems do not actually map the allocated pages to physical ones at allocation time. The memory is only virtually allocated and not physically. Doing a first touch on the allocated memory pages cause a page fault which is pretty expensive. The full page is generally physically mapped at this time and it is reset to zero for security reasons on most systems. In order to measure the memory throughput, you need not to include such an overhead in the benchmark by just pre-allocating the memory and write into the memory chunks ahead of time (if possible with random values).

  • CPU cache can be quite big, the memory buffer written should be much larger than them in order not to measure the throughput of the cache themselves (typically due to cache associativity).

  • One thread is often not enough to saturate the bandwidth of the main memory. Few are often needed to reach the optimal throughput (this is very dependent of the platform, many threads are often need on server processors like Intel Xeon processors). With too many threads, some complex effects (eg. contention) can appear reducing the overall throughput.

  • On NUMA systems, memory access are generally faster if a core access to its own memory. This means threads should be pinned to cores and should read/write into a buffer dedicated to a thread to achieve the best throughput. This is for example especially true on the AMD Ryzen desktop/server processors or on bi-socket server systems.

  • Modern processors often use a variable frequency (see frequency scaling). Moreover, threads can take some time to create and actually start. Thus, using a loop iterating multiple times on the same buffer with a synchronization barrier is important to minimize the bias introduce with this effect. This is also important to check that the time taken by each thread is approximately the same (otherwise, it means that there is an unwanted effect happening like NUMA ones).

  • The amount of memory used should not be too big since some operating system use memory compression strategy (eg. z-swap) to avoid the memory use to be too big. In the worst case, a swapping storage device can be used.

Note that you can use OpenMP to write a parallel code more easily (the resulting code will be smaller and easier to read). OpenMP also enable you to control thread pinning and to thread the right amount of thread regarding the target architecture. OpenMP is supported on most compilers including GCC, Clang, ICC, MSVC (only the version 2.0 for MSVC so far).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59