Profiling a function in c++

Question

I am interested in finding the number of memory accesses of a particular function in my program written in c++ and running on linux. To find number of memory accesses I am making use of cachegrind of Valgrind. I use the following command to get the memory accesses:

cg_annotate --show=Dr,Dw cachegrind.out.25329  |tee log.txt

The number of memory accesses comes like this for my that function:

  379,010,475   697,368,671  ???:CheckInput(std::string)

Now basically I have three functions and I want to make comparison of the three on the basis of number of memory accesses. Now I want to make clear, is this the right way to do comparisons? Do I need to take average number of memory accessses or just one reading of total memory accesses per function is enough? secondly, can I conclude that the one with less number of memory accesses(memory reads+memory writes) as a fast function?

Look into Intel's Pin tool, it is a way to add dynamic instrumentation into your program, you can monitor all memory accesses this way (as callbacks to your instrumentation code). http://software.intel.com/en-us/articles/pintool — amdn, Mar 14 '14 at 20:48
@amdn I made use of pinatrace few months back but that generates a huge file with all the reads and writes of the whole program but I want to find it out for one particular function in my program.Can you please provide any help in this regard. — Xara, Mar 15 '14 at 05:51
@Zara you will have to modify pinatrace (https://github.com/jingpu/pintools/blob/master/source/tools/ManualExamples/pinatrace.cpp) so that before PIN_StartProgram() you get function names with PIN_InitSymbols(), then hook only the function you want with RTN_AddInstrumentFunction(). I've not done this myself, but reading the documentation that is my understanding - http://software.intel.com/sites/landingpage/pintool/docs/49306/Pin/html/ — amdn, Mar 15 '14 at 06:27

score 3 · Accepted Answer · edited May 23 '17 at 12:21

Looking at cachegrind is not a good way to determine the performance of functions in isolation. Such tests are poor indication of how a function is going to perform in real usage on things like branch prediction and cache hit rates.

Cachegrind can often tell you WHY a given implementation is slow in practice, but you need to run it on a representative execution of your whole binary/data.

A couple of other things to note:

Raw dr/dw are not a good proxy of performance since they tell you nothing about your cache hit rate. All else being equal, a function that reads 1000 values from L1 cache is going to be faster than a function that reads a single value from memory that results in page fault, and having to load the page from virtual memory.

You're not going to see anything about failed branch predictions. All else being equal, a function that performs poorly on branch prediction is going to be much slower than an equivalent function structured to perform well on branch prediction.

edit ----------------------------------------------

Since you didn't give any details, I don't know what you're doing. But let's say you wrote tests like these:

for (int i = 0; i < 100000; ++i) {
   func1("test string");
}
for (int i = 0; i < 100000; ++i) {
   func2("test string");
}

This is not representative of how an actual program would use this function (hence what I said about using representative data). Because of that, this test is pretty worthless. It's a "microbenchmark". Everything fits in the cache the first pass through a function, and branch prediction should be far better than real world usage since you're always using the same input.

To write proper performance tests ask yourself "how would I use this function in my application", and write your tests to mimic that. Better yet, actually profile the function in your application. Don't have an application? Then you're optimizing way too soon (unless your interest is purely academic).

For reasons I pointed out in my original answer, a raw count of memory accesses is not going to tell you the performance of a function. Partly because not all memory accesses are created equally. Depending on WHERE the memory is being read from, there are orders of magnitude in difference in the access times ( Approximate cost to access various caches and main memory?). Beyond that, there is much more going on than just memory accesses. You could write a function that performed billions of operations entirely on what's stored in register.

You seem very focused on memory accesses... Have you tried reading up on profiling in general?

Can you please explain it a bit , "you need to run it on a representative execution of your whole binary/data.". secondly, what else can I do to do the comparison of the three functions on the basis of memory accese? — Xara, Mar 14 '14 at 20:32
Where did he state he's profiling these functions in isolation and not in his actual application? — jaho, Mar 15 '14 at 00:34
That was the impression I got from their post and follow up question. At the very least they wanted to know why it made a difference. Where did OP claim they were a 'he'? — patros, Mar 15 '14 at 02:31
@patros Let me clear my scenario again. I have three application codes. Each application has a function named,CheckInput(). This CheckInput function is performing a different algorithm in each application. So, I was interested thst which app's CheckInput function will do its work in less number of memory accesses. — Xara, Mar 15 '14 at 04:36
@patros I got your point from the second last para that it matters a lot that from where the data is accessed. Can you please mention on what other metrics I can use check the performance of the three functions? — Xara, Mar 15 '14 at 04:40

Profiling a function in c++

1 Answers1