How to profile a C++ function at assembly level?

Question

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.

I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.

On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.

Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.

My question:

How do I profile a function at the assembly level for an Intel processor on Windows?

I want to compile, view disassembly, get performance metrics, adjust my code and repeat.

I don't know if it works on Intel processors, but I used AMD CodeAnalyst with profit and it's free (I think it does work but with some features disabled). — Matteo Italia, Oct 02 '11 at 18:52
See http://stackoverflow.com/questions/67554/whats-the-best-free-c-profiler-for-windows-if-there-are — Alan Stokes, Oct 02 '11 at 18:53
Getting the disassembly is easy. Every modern compiler has an option to spit out the assembly with the source code inlined with it. (though the exact option depends on which compiler) I find this very useful when I'm tracking down bottlenecks. — Mysticial, Oct 02 '11 at 18:55
I'd love to see some actual code and suggest things. However, voting close as _just a duplicate_ for now... — sehe, Oct 02 '11 at 18:55
me too, why dont you post the function in question and let us have a go? — AndersK, Oct 02 '11 at 19:02
@Mysticial of course getting the dissassembly is easy but getting metrics relevant for the processor against the assembly is the tricky part. Each platform has its own unique set of ways to do things. — coderdave, Oct 02 '11 at 19:03
@AndersK. The question isn't about optimizing my functions its about workflows for optimizing a function. I'm quite good at doing optimizations I just need a benchmark to know if I'm going in the right direction. I was hoping I'd hear peoples experience doing this sort of thing exactly. — coderdave, Oct 02 '11 at 19:05
I'm not aware of too many instruction-by-instruction profilers. I usually do things on a loop-by-loop basis and compare performance results with what I "should" be getting based on published instruction latencies, port usage and such... If it doesn't perform to my expectations, then I tweak it until it does or until I give up. — Mysticial, Oct 02 '11 at 19:07
@AlanStokes Thanks Alan - I did see that link but I've already identified which function is my bottleneck and now I'd like suggestions on how to benchmark at the assembly level. Like an example as I gave in my post about using an instruction sheet with cycle count on SPU and a pipeline anlayzer. — coderdave, Oct 02 '11 at 19:07
@sehe: It's clearly not a duplicate of Alan's link. What's got into you? — TonyK, Oct 02 '11 at 19:16

Necrolis · Accepted Answer · 2011-10-03T15:51:50.377

11

There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).

However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".

Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.

Its also a good idea to have read the intel optimization manual before diving into this.

EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.

edited Oct 03 '11 at 15:51

answered Oct 02 '11 at 19:11

Necrolis

25,836
3
63
101

Thank you! I'm going to check out the presentation and tool and see how much it helps. – coderdave Oct 02 '11 at 19:14
Except for the fact that IACA generates profile data for a processor that doesn't exist. – Ben Voigt Oct 02 '11 at 22:11
@Ben: my i7 supports AXV (and so does any other sandybridge CPU), so I'm not sure what what non-existent processor you are referring to. IACA is also applicable to more than just AXV processors though. – Necrolis Oct 03 '11 at 15:02
@Necrolis: The ability to generate timing information for older architectures was highly non-obvious from Intel's website. And are you sure that the timing generated by a simulator released long before the processors actually matches the released product? – Ben Voigt Oct 03 '11 at 15:08
@Ben: hmmm, that part is a little obscure, but they would be correct for and i3/i5/i7 processor, which are pretty common these days. – Necrolis Oct 03 '11 at 15:49

score 1 · Answer 2 · answered Oct 03 '11 at 08:43

1

You might want to try some of the utilities included in valgrind like callgrind or cachegrind.

Callgrind can do profiling and dump assembly.

And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.

answered Oct 03 '11 at 08:43

rounin

1,310
10
12

Sorry, didn't notice you are on windows. I don't think valgrind runs on windows. – rounin Oct 03 '11 at 08:45
1

its a pity great, free tools like this don't readily appear on windows :( – Necrolis Oct 03 '11 at 15:57

score 0 · Answer 3 · answered Oct 02 '11 at 21:41

0

From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?

answered Oct 02 '11 at 21:41

Motti

110,860
49
189
262

2

My code is set up to use multiple cores and I would never use parallel_for from ppl(not a fan of it). Though, even being ran on multiple cores and soon on GPU cores my program will still benefit significantly from optimizing this function. Most importantly, this question is for me to learn how to properly optimize a function on windows. I do appreciate the suggestion though because it is good, thank you. – coderdave Oct 02 '11 at 22:07

How to profile a C++ function at assembly level?

3 Answers3