7

I have developed a code that gets as input a large 2-D image (up to 64MPixels) and:

  • Applies a filters on each row
  • Transposes the image (used blocking to avoid lots of cache misses)
  • Applies a filters on the columns (now-rows) of the image
  • Transposes the filtered image back to carry on with other calculations

Although it doesn't change something, for the sake of completeness of my question, the filtering is applying a discrete wavelet transform and the code is written in C.

My end goal is to make this run as fast as possible. The speedups I have so far are more than 10X times through the use of the blocking matrix transpose, vectorization, multithreading, compiler-friendly code etc.

Coming to my question: The latest profiling stats of the code I have (using perf stat -e) have troubled me.

        76,321,873 cache-references                                            
     8,647,026,694 cycles                    #    0.000 GHz                    
     7,050,257,995 instructions              #    0.82  insns per cycle        
        49,739,417 cache-misses              #   65.171 % of all cache refs    

       0.910437338 seconds time elapsed

The (# of cache-misses)/(# instructions) is low at around ~0.7%. Here it is mentioned that this number is a good thing to have in mind to check for memory efficiency.

On the other hand, the % of cache-misses to cache-references is significantly high (65%!) which as I see could indicates that something is going wrong with the execution in terms of cache efficiency.

The detailed stat from perf stat -d is:

   2711.191150 task-clock                #    2.978 CPUs utilized          
         1,421 context-switches          #    0.524 K/sec                  
            50 cpu-migrations            #    0.018 K/sec                  
       362,533 page-faults               #    0.134 M/sec                  
 8,518,897,738 cycles                    #    3.142 GHz                     [40.13%]
 6,089,067,266 stalled-cycles-frontend   #   71.48% frontend cycles idle    [39.76%]
 4,419,565,197 stalled-cycles-backend    #   51.88% backend  cycles idle    [39.37%]
 7,095,514,317 instructions              #    0.83  insns per cycle        
                                         #    0.86  stalled cycles per insn [49.66%]
   858,812,708 branches                  #  316.766 M/sec                   [49.77%]
     3,282,725 branch-misses             #    0.38% of all branches         [50.19%]
 1,899,797,603 L1-dcache-loads           #  700.724 M/sec                   [50.66%]
   153,927,756 L1-dcache-load-misses     #    8.10% of all L1-dcache hits   [50.94%]
    45,287,408 LLC-loads                 #   16.704 M/sec                   [40.70%]
    26,011,069 LLC-load-misses           #   57.44% of all LL-cache hits    [40.45%]

   0.910380914 seconds time elapsed

Here frontend and backend stalled cycles are also high and the lower level caches seem to suffer from a high miss rate of 57.5%.

Which metric is the most appropriate for this scenario? One idea I was thinking is that it could be the case that the workload no longer requires further "touching" of the LL caches after the initial image load (loads the values once and after that it's done - the workload is more CPU-bound than memory-bound being an image filtering algorithm).

The machine I'm running this on is a Xeon E5-2680 (20M of Smart cache, out of which 256KB L2 cache per core, 8 cores).

VAndrei
  • 5,420
  • 18
  • 43
koukouviou
  • 820
  • 13
  • 23
  • koukouviou, `perf stat -d` may be inaccurate, it can be useful to rerun it several times with 4-5 events per run, `perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-loads-misses`. If your stages can be separated, you may measure every stage to find which one generates misses. Or you can run `perf record -e cache-misses` to get profile of code which has most misses (as was recommended in the ["Here"](http://developerblog.redhat.com/2014/03/10/determining-whether-an-application-has-poor-cache-performance-2/) you linked). – osgx Mar 14 '15 at 07:24
  • @osgx I have done the tests you mention but the post was already lengthy so I kept it as brief as I could. Although the stats change a bit, remain statistically similar with high cache misses %. I've run perf record/report/annotate and the misses are attributed largely to kernel-kallsyms, but I don't know what to make of that... – koukouviou Mar 14 '15 at 07:58
  • koukouviou, You may rerun your stat with all events limited to user-space: `perf stat -e event1:u,event2:u,...` and compare with posted results (by default both user and kernel are measured). kernel-kallsyms mean that some function from kernel-space had this event... Some possible issues of unknown/unresolved symbols are listed here: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html. I can't give you advises about metrics.... – osgx Mar 14 '15 at 08:07

1 Answers1

6

The first thing you want to make sure is that no other compute intensive process is running on your machine. That's a server CPU so I thought that could be a problem.

If you use multi-threading in your program, and you distribute equal amount of work between threads, you might be interested collecting metrics only on one CPU.

I suggest disabling hyper-threading in the optimization phase as it can lead to confusion when interpreting the profiling metrics. (e.g. increased #cycles spent in the back-end). Also if you distribute work to 3 threads, you have a high chance that 2 threads will share the resources of one core and the 3rd will have the entire core for itself - and it will be faster.

Perf has never been very good at explaining the metrics. Judging by the order of magnitude, the cache references are the L2 misses that hit the LLC. A high LLC miss number compared with LLC references is not always a bad thing if the number of LLC references / #Instructions is low. In your case, you have 0.018 so that means that most of your data is being used from L2. The high LLC miss ratio means that you still need to get data from RAM and write it back.

Regarding #Cycles BE and FE bound, I'm a bit concerned about the values because they don't sum to 100% and to the total number of cycles. You have 8G but staying 6G cycles in the FE and 4G cycles in the BE. That does not seem very correct.

If the FE cycles is high, that means you have misses in the instruction cache or bad branch speculation. If the BE cycles is high, that means you wait for data.

Anyway, regarding your question. The most relevant metric to asses the performance of your code is Instructions / Cycle (IPC). Your CPU can execute up to 4 instructions / cycle. You only execute 0.8. That means resources are underutilized, except for the case where you have many vector instructions. After IPC you need to check branch misses and L1 misses (data and code) because those generate most penalties.

A final suggestion: you may be interested in trying Intel's vTune Amplifier. It gives a much better explaining on the metrics and points you to the eventual problems in your code.

VAndrei
  • 5,420
  • 18
  • 43
  • 1
    VAndrei, BE and FE stalls will never sum with instructions to the 100% of cycles, because most Intel CPUs are superpipelined. Many instructions may be in flight in pipeline, and some will generate stalls; but some will be executed. 71% FE stalls is high (you have much code which is complex to decode?) and 51% BE stalls is rather high too - so there are several problems... koukouviou, you may also try `toplev.py` from [andikleen/pmu-tools (github)](https://github.com/andikleen/pmu-tools) as free palliative to Vtune. – osgx Mar 14 '15 at 21:47
  • @osgx. I'm not so sure about that. FE bound means that the reservation unit is not fed with instructions so no execution ports have work to do. That is due to code cache misses, complex instructions and bad speculation. If BE is high that means the instructions are keeping the execution ports busy but they are waiting for resources (data or free registers) or they have high latencies (e.g. transcedentals). There should be another metric: #cycles retiring. If you look to vTune you will see that FE+BE+Retiring add to 100% always. – VAndrei Mar 14 '15 at 21:57
  • vTune may use different counters for FE/BE. Perf uses (according to Manuel Selva answer http://stackoverflow.com/a/22166684 and [sources for SandyBridge](http://lxr.free-electrons.com/source/arch/x86/kernel/cpu/perf_event_intel.c?v=3.16#L2478) ) `UOPS_ISSUED.ANY,umask=0x01,inv,cmask=0x01` for FE stalls and `UOPS_DISPATCHED.THREAD,umask=0x01,inv,cmask=0x01` for BE. – osgx Mar 14 '15 at 22:06
  • VAndrei, thanks for all the insights. I have used VTune as well as perf during my optimization "journey" on this code. As for bad branch predictions the number is very low (acceptably low at least). The code was compiled with pgo-gen/pgo-use and the branch prediction was slightly improved. The CPU is a server model but I am the only one with access on the server and nothing else other than the bare minimum processes are running on it. – koukouviou Mar 14 '15 at 23:33
  • IPC is low but as you mentioned I do have a large number of vector instructions (the filtering of the image pixels is done on the SIMD pipe of the CPU - that was a design decision to speed things up). I am using CilkPlus for the multithreading (no restriction on number of threads) so there could be up to 16 threads. @osgx Thanks for the `toplev.py` tool I was not aware of that, I'll give it a shot – koukouviou Mar 14 '15 at 23:35
  • "no other computational process". Just one question here. Does perf count cycles for only a process alone (task-clock), or does it include other processes? What is the purpose of stopping other computational processes? Is it to keep the caches and TLB hot? – Johannes Schaub - litb Aug 31 '17 at 12:57
  • @JohannesSchaub-litb perf can collect statistics system wide or per process. Reducing number of compute intensive processes usually increases accuracy of the collection because perf does sampling. – VAndrei Sep 01 '17 at 04:54