3

I am trying to investigate the performance of my program, whereas cache misses is a huge bottleneck. For testing purposes, before implementing PAPI in to the target application, I needed to verify how stuff works, which is why I posted a sample program.

My intention is to use PAPI for monitoring the cache misses of a separate thread. I am trying to use the PAPI_attach to apply my event sets to the specific thread ID, however, the cache misses which I measure are still the same (or at least VERY similar) for when not using the PAPI_attach.

Another experiment I did to verify my concerns was to start the Firefox browser during a run of this very simple program. This let to an increased amount of measured cache misses, so obviously, something is very strange regarding the PAPI_attach function and how I am using it.

Using the below code for my thread worker:

void * Slave(void * args)
{

   int rc = 0;
   int tmp, i, j;
   /*must be initialized to PAPI_NULL before calling PAPI_create_event*/
   int EventSet = PAPI_NULL;
   long long values[NUM_EVENTS];
   /*This is where we store the values we read from the eventset */

   /* We use number to keep track of the number of events in the EventSet */ 
   int retval, number;


   pid_t tid;
   tid = syscall(SYS_gettid);

   char errstring[PAPI_MAX_STR_LEN];

   /* get the number of events in the event set */
   number = 0;

   printf("My pid is: %d\n", tid);

   if ( (retval=PAPI_register_thread())!= PAPI_OK )
       ERROR_RETURN(retval);

   if ( (retval = PAPI_create_eventset(&EventSet)) != PAPI_OK)
      ERROR_RETURN(retval);

   /* Add Total Instructions Executed to the EventSet */
   if ( (retval = PAPI_add_event(EventSet, PAPI_L1_TCM)) != PAPI_OK)
      ERROR_RETURN(retval);

   /* Add Total Cycles event to the EventSet */
   if ( (retval = PAPI_add_event(EventSet, PAPI_L2_TCM)) != PAPI_OK)
      ERROR_RETURN(retval);

   if ( (retval = PAPI_add_event(EventSet, PAPI_L3_TCM)) != PAPI_OK)
      ERROR_RETURN(retval);

   number = 0;
   if ( (retval = PAPI_list_events(EventSet, NULL, &number)) != PAPI_OK)
      ERROR_RETURN(retval);

   printf("There are %d events in the event set\n", (unsigned int)number);

   if ((retval = PAPI_attach(EventSet, tid)) != PAPI_OK)
      ERROR_RETURN(retval);
   /* Start counting */

   if ( (retval = PAPI_start(EventSet)) != PAPI_OK)
      ERROR_RETURN(retval);

   /* you can replace your code here */

   tmp=0;
   for (i = 0; i < 200000000; i++)
   {
      tmp = i + tmp;
   }

   if ( (retval=PAPI_read(EventSet, values)) != PAPI_OK)
      ERROR_RETURN(retval);

   printf("L1 misses %lld \n", values[0] );
   printf("L2 misses %lld \n",values[1]);
   printf("L3 misses %lld \n",values[2]);

   if ( (retval = PAPI_stop(EventSet, values)) != PAPI_OK)
      ERROR_RETURN(retval);



   /* free the resources used by PAPI */
   PAPI_shutdown();

}

And the following code for spawning the thread:

int main()
{
   pthread_t master;
   pthread_t slave1;
   pthread_attr_t attr;
   int rc = 0;

   int retval, number;
   unsigned long pid;
   pid = PAPI_thread_id();
   char errstring[PAPI_MAX_STR_LEN];

   pthread_attr_init(&attr);
   pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);

   if((retval = PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT )
          ERROR_RETURN(retval); 

   if ((retval = PAPI_thread_init(&pthread_self)) != PAPI_OK)
     ERROR_RETURN(retval);

   rc = pthread_create(&slave1, &attr, Slave, NULL);
   pthread_join(slave1, NULL);
   exit(0);
}

The bad thing is that i get no errors, which indicate that everything is working.

HMD
  • 2,202
  • 6
  • 24
  • 37
Jakob Danielsson
  • 767
  • 1
  • 8
  • 16
  • 1
    I should probably also comment the system specifics with some results: L1D&I cache: 64Kb L2 cache: 512Kb Shared L3 cache 4MB. Results: L1 misses 4988 L2 misses 10033 L3 misses 6737. Which adds furthermore suspicion to my results, why would I get more cache misses on the bigger (Local) L2 cache than the L1 cache? – Jakob Danielsson Nov 15 '17 at 10:45
  • Jakob, can you count L1/L2 accesses too? What is your CPU model (microarchitecture)? Many CPUs have builtin auto-prefetchers, for example intel has two at L1 and two more at L2 level: when program does two accesses with some step between them autoprefetcher hardware will detect it and send some requests with same step added to the address of your previous request (until prefetched address is still in the same physical page). Some hw prefetches are very aggressive and can do many additional requests. Some can be disabled, check intel link from https://stackoverflow.com/a/41917209 – osgx Nov 25 '17 at 18:39
  • Are the numbers from the code you posted or from your application? The code you posted (`tmp = i + tmp` in a loop) leads to no cache misses and can even be optimized away completely. The numbers you report are tiny and may just be artificial. Can you reproduce it with a code that actually leads to cache misses in itself and confirmedly runs for a reasonable amount of time. – Zulan Nov 28 '17 at 21:28
  • @Zulan, I also tested the implement with -O0 optimization using volatile for my tmp variable, so it should not be optimized away – Jakob Danielsson Nov 29 '17 at 09:28
  • @osgx Hmm I'm using a quad-core intel 3570, so yes it has pre-fetchers. What confused me most is that the L1 cache seems to have less misses than L2, from my understanding, caches are hierachial the L1 cache will always miss when there is a miss in the L2 cache? However, the results may be, as you say an effect from the prefetcher. When testing the Papi library with some cache intensive code, i.e. image processing, the cache lines seems to produce a more reasonable result with lowest misses in L1, then L2, then L3. – Jakob Danielsson Nov 29 '17 at 09:36
  • Maybe the code provided in this question isn't sufficient measure cache misses accurately? – Jakob Danielsson Nov 29 '17 at 09:39
  • Let me try again. Your core code has no inherent cache misses. What you measure is probably just noise. Given that, what is your actual question? – Zulan Nov 29 '17 at 23:04
  • Exactly, so I bind the thread id to the eventset using attach, to just measure cache misses associated with that only that tid. If what you say is true, i shouldn't register noise at all, due to the fact that i monitor the cache misses on that thread only. My actual question was regarding whether I can trust that the PAPI_attach functionality measures cache misses for only one tid in specific. – Jakob Danielsson Nov 30 '17 at 00:53

0 Answers0