Where to learn about low-level, hard-core performance stuffs?

Question

This is actually a 2 part question:

For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.

I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.

(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)

For x86 machines check out the info [here](http://www.agner.org/optimize/) and (for more historical reasons) Michael Abrashs [Black Book](http://www.gamedev.net/page/resources/_/reference/programming/140/283/graphics-programming-black-book-r1698). — user786653, Aug 31 '11 at 16:05
QEMU is quite a useful tool for instruction-level performance analysis — SK-logic, Aug 31 '11 at 16:11
Argh! All answers are good, how do I go about choosing one!? — kizzx2, Sep 01 '11 at 12:56

score 3 · Answer 1 · answered Aug 31 '11 at 16:09

The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.

That said, to answer your question: Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references, http://www.mark.masmcode.com/ and http://www.agner.org/optimize/

Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html

If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)

`And your maintainers will hate you` -- Meh. That's what code comments are for. — Robert Harvey, Sep 01 '11 at 01:23

old_timer · Answer 2 · 2011-09-01T19:49:16.830

EDIT/REWRITE:

If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.

I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).

The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.

It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.

`x86 is not an interesting platform for this` -- No, it's only the most commonly-used platform on the planet. — Robert Harvey, Sep 01 '11 at 01:25
thanks for the improvement in my answer BTW, should have spent more effort/time on it. — old_timer, Sep 01 '11 at 01:33
@dwelch Would be nice if you backed up your claim "x86 is one of the least commonly used processors on the planet". I am not to flame or anything but I'd like to be able to back it up myself when I tell my friends this kind of things. — kizzx2, Sep 01 '11 at 10:19

score 1 · Answer 3 · answered Aug 31 '11 at 19:31

Intel and AMD provide optimization manuals for x86 and x86-64.

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/

http://developer.amd.com/documentation/guides/pages/default.aspx

Another excellent resource is agner.

http://www.agner.org/optimize/

Some of the key points (in no particular order):

Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs

score 1 · Answer 4 · edited Sep 01 '11 at 01:27

1

I'd suggest Optimizing subroutines in assembly language An optimization guide for x86 platforms.

It's quite heavy stuff though ;)

edited Sep 01 '11 at 01:27

Robert Harvey

178,213
47
333
501

answered Aug 31 '11 at 23:01

BlackBear

22,411
10
48
86

I fixed your bare link for you. – Robert Harvey Sep 01 '11 at 01:27

score 1 · Answer 5 · edited May 23 '17 at 12:14

1

Yes, measure, and yes, know all those techniques.

Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".

They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output. SO is full of them.

What I don't hear a lot of is success stories, with speedup factors achieved.

The method I use is very simple, and I've tried to give lots of examples, including this case.

edited May 23 '17 at 12:14

Community

1
1

answered Sep 01 '11 at 01:20

Mike Dunlavey

40,059
14
91
135

:) Hello Mike, I kind of expected you when I added the tags "performance" and "optimization" :P I have used your "statistical" profiling techniques to great success and I sometimes promote it. Anyway for this question I am really interested in _learning_ what's behind the scene (i.e. I don't really have an actual application I need to optimize on hand) – kizzx2 Sep 01 '11 at 10:17
@kizzx2: Sorry if I'm so predictable :) For me it's a real treat to be faced with the problem of tuning the daylights out of a piece of code, provided I have the right to modify it. What frustrates me is when I can tell exactly what needs to be done and roughly how much it will save, but the code has an "owner" who simply doesn't want to do it. – Mike Dunlavey Sep 01 '11 at 11:36

Where to learn about low-level, hard-core performance stuffs?

5 Answers5