Measurement of TLB effects on a Cortex-A9

Question

After reading the following paper https://people.freebsd.org/~lstewart/articles/cpumemory.pdf ("What every programmer should know about memory") I wanted to try one of the author's test, that is, measuring the effects of TLB on the final execution time.

I am working on a Samsung Galaxy S3 that embeds a Cortex-A9.

According to the documentation:

we have two micro TLBs for instruction and data cache in L1 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
The main TLB is located in L2 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
Data micro TLB has 32 entries (instruction micro TLB has either 32 or 64 entries)
L1' size == 32 Kbytes
L1 cache line == 32 bytes
L2' size == 1MB

I wrote a small program that allocates an array of structs with N entries. Each entry's size is == 32 bytes so it fits in a cache line. I perform several read access and I measure the execution time.

typedef struct {
     int elmt; // sizeof(int) == 4 bytes
     char padding[28]; // 4 + 28 = 32B == cache line size
}entry;


volatile entry ** entries = NULL;

//Allocate memory and init to 0
entries = calloc(NB_ENTRIES, sizeof(entry *));
if(entries == NULL) perror("calloc failed"); exit(1);

for(i = 0; i < NB_ENTRIES; i++)
{
      entries[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
}

entries[LAST_ELEMENT]->elmt = -1

//Randomly access and init with random values
n = -1;
i = 0;
while(++n < NB_ENTRIES -1)
{
      //init with random value
      entries[i]->elmt = rand() % NB_ENTRIES;

      //loop till we reach the last element
      while(entries[entries[i]->elmt]->elmt != -1)
      {
            entries[i]->elmt++;
            if(entries[i]->elmt == NB_ENTRIES)
                     entries[i]->elmt = 0;
      }

       i = entries[i]->elmt;
}


gettimeofday(&tStart, NULL);
for(i = 0; i < NB_LOOPS; i++)
{
     j = 0;
     while(j != -1)
     {
          j = entries[j]->elmt
     }
}
gettimeofday(&tEnd, NULL);

time = (tEnd.tv_sec - tStart.tv_sec);
time *= 1000000;
time += tEnd.tv_usec - tStart.tv_usec;
time *= 100000
time /= (NB_ENTRIES * NBLOOPS);

fprintf(stdout, "%d %3lld.%02lld\n", NB_ENTRIES, time / 100, time % 100);

I have an outer loop that makes NB_ENTRIES vary from 4 to 1024.

As one can see in the figure below, while NB_ENTRIES == 256 entries, executing time is longer.

When NB_ENTRIES == 404 I get an "out of memory" (why? micro TLBs exceeded? main TLBs exceeded? Page Tables exceeded? virtual memory for the process exceeded?)

Can someone explain me please what is really going on from 4 to 256 entries, then from 257 to 404 entries?

Effects of TLB on execution time

EDIT 1

As it has been suggested, I ran membench (src code) and below the results:

membench results

EDIT 2

In the following paper (page 3) they ran (I suppose) the same benchmark. But the different steps are clearly visible from their plots, which is not my case.

enter image description here

Right now, according to their results and explanations, I only can identify few things.

plots confirm that L1 cache line size is 32 bytes because as they said

"once the array size exceeds the size of the data cache (32KB), the reads begin to generate misses [...] an inflection point occurs when every read generates a misse".

In my case the very first inflection point appears when stride == 32 Bytes. - The graph shows that we have a second-level (L2) cache. I think it is depicted by the yellow line (1MB == L2 size) - Therefore the two last plots above the latter probably reflects the latency while accessing Main Memory (+ TLB?).

However from this benchmark, I am not able to identify:

the cache associativity. Normally D-Cache and I-Cache are 4-way associative (Cortex-A9 TRM).
The TLB effects. As they said,

in most systems, a secondary increase in latency is indicative of the TLB, which caches a limited number of virtual to physical translations.[..] The absence of a rise in latency attributable to TLB indicates that [...]"

large page sizes have probably been used/implemented.

EDIT 3

This link explains the TLB effects from another membench graph. One can actually retrieve the same effects on my graph.

On a 4KB page system, as you grow your strides, while they're still < 4K, you'll enjoy less and less utilization of each page [...] you'll have to access the 2nd level TLB on each access [...]

The cortex-A9 supports 4KB pages mode. Indeed as one can see in my graph up to strides == 4K, latencies are increasing, then, when it reachs 4K

you suddenly start benefiting again since you're actually skipping whole pages.

Did you forget to unmap your entries ? That would certainly explain the out of memory. — ElderBug, May 29 '15 at 13:42
hum actually yep, but if I do not unmap the entries, what memory do I run out of? allocated virtual space for my process? — D4l3k, May 29 '15 at 13:58
Yes. The CPU isn't supposed to have a memory, its usage is transparent. So you can only run out of virtual memory, which is often related to physical RAM and swap space. BTW, you should include the code for `InitWithRandomValues`, since it seems to define how you access the memory. — ElderBug, May 29 '15 at 14:05
didn't inspect your code, but at best you are measuring (seeing the effect) of cache size/cache line size. TLB is another issue all together. Don't know how you could possibly measure the effect it has, other than having a hardware identical, but without TLB. — bolov, May 29 '15 at 14:20
@bolov flushing the TLB before each access would probably do, but I cannot see that in the posted code either. — Jonas Schäfer, May 29 '15 at 14:21
**L1** and **L2** are dominate on the ARM. The **TLB** (and micro-TLB) are separate structures; related to MMU page table mappings. For sections/super-sections, the load is light. It depends on the OS mappings; do they glob entries together or is everything a 4k page? In any case, the TLB on the ARM is not significant compare to L1/L2 for most work loads. — artless noise, May 29 '15 at 14:22
He is *flushing* the TLB by using `mmap()` to keep allocating new MMU tables. There is cachebench and membench, etc that will give better measurements. This code is not a good benchmark suite. — artless noise, May 29 '15 at 14:22
Oh, I didn't know that mmap() implies that the TLB is flushed. Indeed, you're right this code is not a good benchmark suite, i was just trying to do things from scratch so I can (may be) get a better understanding instead of directly using existing tools. — D4l3k, May 29 '15 at 14:34
It may also be informative to simply use perf to count TLB misses vs. cache misses, etc. for various bits of code. — Notlikethat, May 29 '15 at 17:28
There is nothing wrong with trying to re-invent things. However, you can save everyones time by looking at existing solutions. Just run membench and see if your benchmark gives similar or divergent results. Also, you could tag with *linux* or a relevant OS tag so we could make suggestions like perf, if it exists for your OS. — artless noise, May 31 '15 at 19:52
By membench you mean this ? http://bebop.cs.berkeley.edu/notes/matmul2002/membench/membench.c http://www.cs.berkeley.edu/~demmel/cs267_Spr99/Assignments/membench/membench.c — D4l3k, Jun 01 '15 at 12:36
You could have posted your *membench* results/edits as an answer. I agree that the 'bump' in the graph is due to TLB effects. I think that Drepper's paper is for the x86 with a PIPT cache. The older ARM is VIVT and there is no need for a TLB on a cache hit. (Cortex-A15 is PIPT, I think and all ARMv8 possibly). [Someone else](http://stackoverflow.com/questions/16765389/is-it-true-that-modern-os-may-skip-copy-when-realloc-is-called) was also concerned about TLB on an ARM. — artless noise, Oct 11 '15 at 14:55
This code also never frees entries before the next outer loop execution. — stark, Feb 28 '16 at 20:39
A question: The Graphs from the "paper" are from a Cray and a Dec Alpha. Are you expecting to see some kind of correlation between those results and the processor in your cell phone? — Bing Bang, Jun 15 '16 at 21:12
Nope. I just wanted to visualize and understand the behavior of the processor on the mobile device, without using an automated tool such as for instance: perf https://perf.wiki.kernel.org/index.php/Main_Page. From CPUs (http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/CSD-92-684.pdf) to GPUs (http://arxiv.org/pdf/1509.02308.pdf), different techniques exist to highlight the effect of the memory hierarchies. — D4l3k, Jun 16 '16 at 09:12
This line from your code looks very weird: `if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);`. Without braces around the block of code, this means you always exit. — Sjlver, Jun 21 '16 at 17:00
`The main TLB is located in L2`. This is a common misconception. An L2TLB is a second level for the L1TLB, just like L2 cache is a second level for the L1D and L1I caches. The L2TLB doesn't use entries in or care about the contents of the "regular" L2 cache. The linked description sounds like how [Intel implements multi-level TLBs on high-power x86 CPUs like Haswell](http://www.realworldtech.com/haswell-cpu/5/), with small/fast L1I and L1D TLBs, and a larger common pool that's not fully associative. The diagram in Kanter's article should be useful. — Peter Cordes, Aug 23 '16 at 06:28
**Does Linux use [transparent hugepages](https://www.kernel.org/doc/Documentation/vm/transhuge.txt) on the system you tested on**? The doc you linked shows that the TLBs support 1MB and 16MB hugepages (which ARM calls "(super) sections"), so the entire block could be mapped by a few TLB entries (leading to no TLB misses after an initial warm-up period of soft page misses as the kernel wires each page and then notices that it could use a hugepage.). Look for the `AnonHugePages:` entry in `/proc/meminfo` being non-zero. — Peter Cordes, Aug 23 '16 at 07:35
The C standard does not require calloc to give aligned memory and I don't see where there is any care taken to align on 32-byte boundaries. Are you sure you are aligning properly? http://stackoverflow.com/questions/23092621/why-is-there-no-aligned-calloc-in-c11 — stark, Nov 19 '16 at 15:33

score 2 · Answer 1 · edited May 23 '17 at 12:18

tl;dr -> Provide a proper MVCE.

This answer should be a comment but is too big to be posted as comment, so posting as answer instead:

I had to fix a bunch of syntax errors (missing semicolons) and declare undefined variables.
After fixing all those problems, the code did NOTHING (the program quit even prior to executing the first mmap. I'm giving the tip to use curly brackets all the time, here is your first and your second error caused by NOT doing so:

.

// after calloc:
if(entries == NULL) perror("calloc failed"); exit(1);
// after mmap
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);

both lines just terminate your program regardless of the condition.

Here you got an endless loop (reformatted, added curly brackets but no other change):

.

//Randomly access and init with random values
n = -1;
i = 0;
while (++n < NB_ENTRIES -1) {
    //init with random value
    entries[i]->elmt = rand() % NB_ENTRIES;

    //loop till we reach the last element
    while (entries[entries[i]->elmt]->elmt != -1) {
        entries[i]->elmt++;
        if (entries[i]->elmt == NB_ENTRIES) {
            entries[i]->elmt = 0;
        }
    }

    i = entries[i]->elmt;
}

First iteration starts by setting entries[0]->elmt to some random value, then inner loop increments until it reaches LAST_ELEMENT. Then i is set to that value (i.e. LAST_ELEMENT) and second loop overwrites end marker -1 to some other random value. Then it's constantly incremented mod NB_ENTRIES in the inner loop until you hit CTRL+C.

Conclusion

If you want help, then post a Minimal, Complete, and Verifiable example and not something else.

Measurement of TLB effects on a Cortex-A9

1 Answers1