31

I would like to learn more about low level code optimization, and how to take advantage of the underlying machine architecture. I am looking for good pointers on where to read about this topic.

More details:

I am interested in optimization in the context of scientific computing (which is a lot of number crunching but not only) in low level languages such as C/C++. I am in particular interested in optimization methods that are not obvious unless one has a good understanding of how the machine works (which I don't---yet).

For example, it's clear that a better algorithm is faster, without knowing anything about the machine it's run on. It's not at all obvious that it matters if one loops through the columns or the rows of a matrix first. (It's better to loop through the matrix so that elements that are stored at adjacent locations are read successively.)

Basic advice on the topic or pointers to articles are most welcome.

Answers

Got answers with lots of great pointers, a lot more than I'll ever have time to read. Here's a list of all of them:

I'll need a bit of skim time to decide which one to use (not having time for all).

Szabolcs
  • 24,728
  • 9
  • 85
  • 174
  • Thanks for all the answers. Although I can only accept one answer, all were very helpful. Unfortunately I was unable to get the Intel book, which I suspect might be more useful for me than the accepted answer – Szabolcs Aug 08 '11 at 12:42

8 Answers8

17

Drepper's What Every Programmer Should Know About Memory [pdf] is a good reference to one aspect of low-level optimisation.

caf
  • 233,326
  • 40
  • 323
  • 462
  • +1, From a quick skim: at 114 pages it's *very* detailed and technical, going as far as talking about the physical implementation of different memory types, but also contains a lot of information relevant to my question, in particular about using the CPU cache. There's a lot of info about optimization and correct benchmarking. – Szabolcs Jul 28 '11 at 00:46
  • 1
    The chapter on caches is truly a must-read for any programmer who doesn't hate their customers. – Crashworks Jul 29 '11 at 05:10
12

For Intel architectures this is priceless: The Software Optimization Cookbook, Second Edition

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • From the description looks like it's exactly what I am looking for :-) – Szabolcs Jul 28 '11 at 00:08
  • do you know of a similar book but for arm architecture? – Vinicius Kamakura Jul 28 '11 at 00:14
  • @hexa: unfortunately I don't know any. A quick search reveals things like http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745 but I have never opened these one. For the one in my link I can 100% vouch is excellent quality and packed with tips and tricks. – Remus Rusanu Jul 28 '11 at 00:23
6

It's been a few years since I read it, but Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level by Randall Hyde was quite good. It gives good examples of how C/C++ code translates into assembly, e.g. what really happens when you have a big switch statement.

Also, altdevblogaday.com is focused on game development, but the programming articles might give you some ideas.

celion
  • 3,864
  • 25
  • 19
6

An interesting book about bit manipulation and smart ways of doing low-level things is Hacker's Delight.

This is definitely worth a read for everyone interested in low-level coding.

Szabolcs
  • 24,728
  • 9
  • 85
  • 174
murrekatt
  • 5,961
  • 5
  • 39
  • 63
5

Check out: http://www.agner.org/optimize/

Mike
  • 1,760
  • 2
  • 18
  • 33
4

C and C++ are usually the languages that are used for this because of their speed (ignoring Fortran as you didn't mention it). What you can take advantage of (which the icc compiler does a lot) is SSE instruction sets for a lot of floating point number crunching. Another thing that is possible is the use of CUDA and Stream API's for Nvidia/Ati respectively to do VERY fast floating point operations on the graphics card while leaving the CPU free to do the rest of the work.

Jesus Ramos
  • 22,940
  • 10
  • 58
  • 88
  • Yes, for example one thing I'd like to learn about is how to take advantage of SSE: how to write code from which the compiler can easily generate SSE code, or how to *explicitly* use SSE from C++ without resorting to assembly language. – Szabolcs Jul 28 '11 at 00:06
  • I think there are wrapper libraries out there and if you can get your hands on the ICC compiler (not sure if it's free for certain platforms) it does a lot of this for you. Also where did that -1 come from? – Jesus Ramos Jul 28 '11 at 00:08
  • I was assuming so, oh well. I would also recommend looking into Stream/CUDA for pipelining heavy mathematical operations as graphics cards are able to do that incredibly fast. – Jesus Ramos Jul 28 '11 at 00:10
  • I know about CUDA (and would like to try it, but don't have access to a suitable GPU so never used it so far), however the gist of my question is not *what* technology to use, but *how* to use the CPU, in particular an x86/x64 CPU, efficiently. – Szabolcs Jul 28 '11 at 00:13
  • That's very hard to answer without a broad answer as we don't know what you're trying to do exactly. I would start with SSE and maybe even the intel compiler (again not sure if it's free) and look at the book posted by the other answer. – Jesus Ramos Jul 28 '11 at 00:15
  • check out intel's ispc compiler, it is open source so you can see the auto-vectorization at work (or just use it if it produces fast-enough code) –  Jun 27 '14 at 18:25
2

Another approach to this is hands-on comparison. You can get a library like Blitz++ (http://www.oonumerics.org/blitz/) which - I've been told - implements aggressive optimisations for numeric/scientific computing, then write some simple programs doing operations of interest to you (e.g. matrix multiplications). As you use Blitz++ to perform them, write your own class that does the same, and if Blitz++ proves faster start investigating it's implementation until you realise why. (If yours is significantly faster you can tell the Blitz++ developers!)

You should end up learning about a lot of things, for example:

  • memory cache access patterns
  • expression templates (there are some bad links atop Google search results re expression templates - the key scenario/property you want to find discussion of is that they can encode many successive steps in a chain of operations such that they all be applied during one loop over a data set)
  • some CPU-specific instructions (though I haven't checked they've used such non-portable techniques)...
Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
1

I learned a lot from the book Inner Loops. It's ancient now, in computer terms, but it's very well written and Rick Booth is so enthusiastic about his subject I would still say it's worth looking at to see the kind of mindset you need to make a CPU fly.

Bryan
  • 11,398
  • 3
  • 53
  • 78
  • Thanks for the suggestion! The book is available [here \[archive.org\]](http://archive.org/details/innerloopssource00boot) as well, but it is accessible only under certain conditions. – Szabolcs Nov 13 '12 at 19:13