0

My question is about understanding the Linux perf tool metrics. I did an optimisations related to prefetch/cache-misses in my code, that is now faster. However, perf does not show me that (or more certainly, I do not understand what perf shows me).

Taking it back to where it all began. I did an investigation in order to speed up random memory access using prefetch.

Here is what my program does:

  1. It uses two int buffers of the same size
  2. It reads one-by-one all the values of the first buffer
    • each value is a random index in the second buffer
  3. It reads the value at the index in the second buffer
  4. It sums all the values taken from the second buffer
  5. It does all the previous steps for bigger and bigger
  6. At the end, I print the number of voluntary and involuntary CPU context switches

After my last tunings, my code is the following one:

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#include <sched.h>

#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
#define PADDING     256

unsigned int randomUint()
{
  int value = rand() % UINT_MAX;
  return value;
}


unsigned int * createValueBuffer()
{
  unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    valueBuffer[i] = randomUint();
  }

  return (valueBuffer);
}


unsigned int * createIndexBuffer()
{
  unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + PADDING) * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    indexBuffer[i] = rand() % BUFFER_SIZE;
  }

  return (indexBuffer);
}


double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
  double sum = 0;

  for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
  {
    __builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
    unsigned int index = indexBuffer[i];
    unsigned int value = valueBuffer[index];
    double s = sin(value);
    sum += s;
  }

  return (sum);
}


unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
  unsigned int * valueBuffer = createValueBuffer();
  unsigned int * indexBuffer = createIndexBuffer();

  struct timeval startTime, endTime;
  gettimeofday(&startTime, NULL);

  double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);

  gettimeofday(&endTime, NULL);

  printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
  free(indexBuffer);
  free(valueBuffer);

  return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);

}


void testWithPrefetchStep(unsigned short prefetchStep)
{
    unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
    printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));

}


int iterateOnPrefetchSteps()
{
  printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
  for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
  {
    testWithPrefetchStep(prefetchStep);
  }
}

void setCpuAffinity(int cpuId)
{
  int pid=0;
  cpu_set_t mask;
  unsigned int len = sizeof(mask);
  CPU_ZERO(&mask);
  CPU_SET(cpuId,&mask);
  sched_setaffinity(pid, len, &mask);
}

int main(int argc, char ** argv)
{
  setCpuAffinity(7);

  if (argc == 2)
  {
    testWithPrefetchStep(atoi(argv[1]));
  }
  else
  {
    iterateOnPrefetchSteps();
  }

}

At the end of my previous stackoverflow question I thought I had all the elements: In order to avoid cache-misses I made my code prefetching data (using __builtin_prefetch) and my program was faster. Everything looked as normal as possible

However, I wanted to study it using the Linux perf tool. So I launched a comparison between two executions of my program:

  • ./TestPrefetch 0: doing so, the prefetch is inefficient because it is done on the data that is read just after (when the data is accessed, it cannot have be loaded in the CPU cache). Run duration: 21.346 seconds
  • ./TestPrefetch 1: Here the prefetch is far more efficient because data is fetched one loop-iteration before it is read. Run duration: 12.624 seconds

The perf outputs are the following ones:

$ gcc -O3 TestPrefetch.c -o TestPrefetch -lm  && for cpt in 0 1; do echo ; echo "### Step=$cpt" ; sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./TestPrefetch $cpt; done  

### Step=0
prefetchStep = 0, Sum = -1107.523504 - Time: 21346278 micro-seconds = 21.346 seconds

 Performance counter stats for './TestPrefetch 0':

      24387,010283      task-clock (msec)         #    1,000 CPUs utilized          
    97 274 163 155      cycles                    #    3,989 GHz                    
    59 183 107 508      instructions              #    0,61  insn per cycle         
       425 300 823      cache-references          #   17,440 M/sec                  
       249 261 530      cache-misses              #   58,608 % of all cache refs    

      24,387790203 seconds time elapsed


### Step=1
prefetchStep = 1, Sum = -1107.523504 - Time: 12623665 micro-seconds = 12.624 seconds

 Performance counter stats for './TestPrefetch 1':

      15662,864719      task-clock (msec)         #    1,000 CPUs utilized          
    62 159 134 934      cycles                    #    3,969 GHz                    
    59 167 595 107      instructions              #    0,95  insn per cycle         
       484 882 084      cache-references          #   30,957 M/sec                  
       321 873 952      cache-misses              #   66,382 % of all cache refs    

      15,663437848 seconds time elapsed

Here, I have difficulties to understand why am I better:

  • The number of cache-misses is almost the same (I even have a little bit more): I can't understand why and (overall) if so, why am I faster?
  • what are the cache-references?
  • what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
Community
  • 1
  • 1
Philippe MESMEUR
  • 737
  • 8
  • 22

3 Answers3

1

I wouldn't trust the perf summary, as it's not very clear what each name represents and which perf counter are they programmed to follow. The default settings have also been known to count the wrong things (see - https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604)

What could happen here is that your cache miss counter may count also the prefetch instructions (which may seem as loads to the machine, especially as you descend in the cache hierarchy). In that case, having more cache references (lookups) makes sense, and you would expect these requests to be misses (the whole point of a prefetch is to miss...).

Instead of relying on some ambiguous counter, find our the counter IDs and masks for your specific machine that represent demand reads lookups and misses, and see if they improved.

Edit: looking at your numbers again, I see an increase of ~50M accesses, but ~70M misses. It's possible that there are more misses due to cache thrashing done by the prefetches

Leeor
  • 19,260
  • 5
  • 56
  • 87
0

can't understand why and (overall) if so, why am I faster?

Because you run more instructions per the time. The old one:

0,61  insn per cycle 

and the new one

0,95  insn per cycle

what are the cache-references?

The count how many times the cache was asked if it does contain the data you were loading/storing.

what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?

Yes. But note that in today processors, there is no wait for any of this. The instructions are executed out-of-order, usually prefetched and if the next instruction needs some data that are not ready, other instructions will get executed.

Jakuje
  • 24,773
  • 12
  • 69
  • 75
  • `insn per cycle` is an average on all the execution, isn't it? My program is now faster (12.624 seconds against 21.346 seconds) so logically `insn per cycle` is higher. But I still do not understand why as the `cache-misses` are higher – Philippe MESMEUR Dec 05 '16 at 15:09
  • In fact, I still do not understand why `cache-misses` are higher – Philippe MESMEUR Dec 05 '16 at 15:10
  • _"The instructions are executed out-of-order"_ what do you exactly mean? _"if the next instruction needs some data that are not ready, other instructions will get executed"_ I do not understand what you mean? Can you please provide me example or good links? Thank you – Philippe MESMEUR Dec 05 '16 at 15:14
  • `cache-misses` are higher, because the prefetch probably counts as a cache miss. It is higher, but you obviously see the results. – Jakuje Dec 05 '16 at 15:17
  • Basically [Intel Hyper Threading](https://en.wikipedia.org/wiki/Hyper-threading) is a kind of OOO execution. – Jakuje Dec 05 '16 at 15:18
0

I recently progress on my perf issues. I discovered a lot of new events among which some are really interesting.

Regarding the current problem, the following event have to be concidered: L1-icache-load-misses

When I monitor my test-application with perf in the same conditions than previously, I get the following values for this event:

     1 202 210      L1-icache-load-misses

against

       530 127      L1-icache-load-misses

For the moment, I do not yet understand why cache-misses events are not impacted by prefetches while L1-icache-load-misses are...

Philippe MESMEUR
  • 737
  • 8
  • 22