measuring latencies of memory

Question

I was going through this link where they are dealing with the statistical data for latencies of main memory, L1 and L2 cache.

I was wondering is it possible to compute the same using a C/c++ code without using the benchmarks tools?

Coding up your own benchmarks requires detailed knowledge of the specific hardware architecture(s) that you are targeting. At the very least you might want to specify the architectures you're interested in. — NPE, Nov 28 '12 at 11:33
@Dietrich Epp - I saw L1 and L2 cache and saw memory and posted it. Skimmed the whole question. — DumbCoder, Nov 28 '12 at 11:38
@NPE, Yes i am aware of the challenges, but was looking for a asm code which can help obtain the result. I was interested for Intel architecture, i3,i5,i7 — user1510753, Nov 28 '12 at 11:50

score 3 · Answer 1 · answered Nov 28 '12 at 12:09

3

The benchmark tools, like LMBench, are written in C. So when you ask if it can be done in C, the answer is quite simply, "yes".

LMBench tests memory latency (in lat_mem_rd.c) by doing repeated pointer indirections. This is the same thing as following a linked list, except there is no content in the list, just a pointer to the next cell.

struct cell { struct cell *next };

struct cell *ptr = ...;
for (i = 0; i < count; i++) {
    ptr = ptr->next;
    ptr = ptr->next;
    ... 100 of these, unrolled ...
    ptr = ptr->next;
    ptr = ptr->next;
}

By adjusting the size of the list, you can control whether the memory accesses hit L1 cache, L2 cache, or main memory. If you are testing L2 cache or main memory, however, you will need to ensure that each memory access is to a cache line old enough that it has been evicted from the faster caches by the time you access it again. Some caches also have support for prefetching, so a "strided" approach may also mean that you hit a faster cache, for certain strides.

You will also need to be sure to enable optimizations (-O2, with GCC/Clang). Otherwise ptr may get stored on the stack, increasing the latency. Finally, you will need to make sure that the compiler does not consider ptr to be a "dead" variable. A sophisticated compiler might notice that the above code doesn't actually do anything. Sometimes when writing benchmarks, the compiler is the enemy. The LMBench code has a function use_pointer() just for this purpose.

answered Nov 28 '12 at 12:09

Dietrich Epp

205,541
37
345
415

very interesting. However, not sure if I understand your code snippet; is the loop supposed to do the 100's of `ptr=ptr->next` for `count` times? Can't you do that by just looping for `100*count` with only one `ptr=ptr->next` in the loop body? – Zane Nov 28 '12 at 12:56
@Zane: The overhead of incrementing the loop variable, performing the comparison, and branching may affect the loop's runtime. As much as possible, we want to measure memory latency and not branch performance. Unrolling the loop 100x means that any error introduced by branch latency will be 100x smaller. On some processors (with good branch prediction and enough superscalar units) this may make no difference, but on some processors this will make a big difference. – Dietrich Epp Nov 28 '12 at 13:09
This is what I was trying to understand: unrolling the statements makes the difference. Thanks – Zane Nov 28 '12 at 16:39
@DietrichEpp for the stack allocation problem of prt, one my add the register qualifier to the ptr variable, compile with -O0 to avoid the dead variable problem, and check the assembler code generated to be sure ptr is register allocated. – Manuel Selva Feb 06 '14 at 12:31
I just checked, and register allocation is done in lat_mem_rd.c https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_mem_rd.c – Manuel Selva Feb 06 '14 at 12:56
@ManuelSelva: I cannot recommend using the `register` qualifier, since compilers are free to ignore it (and in practice, often do). – Dietrich Epp Feb 06 '14 at 17:22

measuring latencies of memory

1 Answers1

Linked