CPU efficient algorithms?

Question

I do fair bit of dabbling in performance focused programming. Typically, most of the techniques I have been taught and am aware of pertain to conserving RAM.

That being said, I recently addressing the question here Memory efficient AI objects for physics game

Where I was told:

it is usually the CPU that runs out of speed before memory is exhausted

We did some testing and it seemed packing/unpacking does conserve RAM, but definitely slows down performance.

But as I have said, most of the typical performance 'rules' that I've seen, deal with conserving RAM.

One of the major topics in program speed, for example, is dynamic memory allocation, which is also focused on RAM conservation.

What I want to know is: What makes code CPU efficient? Do lower-level languages like C have more flexibility for CPU efficiency? If so, why/how?

For the sake of simplicity, lets exclude discussion on assembler languages because they are beyond the scope of this question.

Generally all you need is an algorithm with a low complexity. The compiler can then do some optimization and you're set. — Alexguitar, Dec 19 '15 at 20:00
Please don't make the mistake of thinking that compiler optimisation will improve a poor algorithm. The compiler has no idea what you are trying to achieve. — Weather Vane, Dec 19 '15 at 20:23

score 4 · Accepted Answer · edited May 23 '17 at 12:17

Profiler

First things first, when you go beyond glaring algorithmic inefficiencies, you want to find yourself a nice profiler. A profiler has several benefits:

Shows you precise measurements (where the time is spent, cache misses, branch mispredictions, etc).
Chasing down your top hotspots will tend to rapidly accelerate your learning process and intuition of what causes bottlenecks at a micro level (ex: memory hierarchy and branching).
Will make you a better prioritizer. It'll also teach you what code doesn't need optimization which means you can focus there on other metrics like productivity (maintainability, safety, etc).

For me #2 was actually quite a big one. I didn't really start learning a lot of this stuff very rapidly until I had a profiler in hand. It's kind of how you can learn programming a lot by working on an actual, sizable project and looking up things that crop up in the middle. In the same way, learning about micro-efficiency and computer architecture tends to be easier with a profiler in hand as you chase one hotspot after another and research why it exists.

Memory Optimization

With that aside, probably the number one thing beyond algorithmic complexity (which is about scalability rather than some absolute sense of performance) is memory efficiency.

_{Caveat: this is going to be somewhat oversimplified and doesn't go into compiler design topics like register allocation and stack spills or even a very detailed description of the memory hierarchy.}

The way machines and operating systems work is established in a hierarchical form of memory ranging from the absolute fastest but smallest memory (registers) to the absolute slowest and biggest (disk).

When accessing memory, the system loads it from slower memory into faster memory in largish, aligned chunks. For example, an operating system might page memory from a secondary storage device into physical memory (DRAM) in 4 kilobyte chunks.

[4 kilobyte chunk][*4 kilobyte chunk][4 kilobyte chunk][...]
// '*' indicates the chunk that's loaded in.

When you request to access virtual memory anywhere surrounding an aligned, 4 kilobyte chunk, the system will page in that chunk to DRAM. Yet we're not done yet. Typically before we can do anything, we must load from DRAM into the CPU cache, which itself is divided into a hierarchy. In those cases, memory might be loaded into a 64-byte aligned cache line chunk, like so:

[64-byte chunk][64-byte chunk][*64-byte chunk][...]

... so memory access ends up loading from DRAM into the CPU cache that way. When you request to access memory in DRAM around one of these 64-byte chunks, the whole 64-byte chunk is loaded into the CPU cache.

And then the CPU cache itself is divided up into a hierarchy (though typically all using the same cache line size), and memory is moved down to faster but smaller CPU caches (fastest being L1). And last but not least, before doing things like arithmetic, the memory from the L1 cache is loaded into a register, which might be, say, 64-bits in size for a general-purpose CPU register. In that case, we end up having our CPU cache memory laid out like this within a 64-byte cache line:

[64 bits][64 bits][64 bits][*64 bits][64 bits][...]

So finally after working our way down to the smallest and fastest memory, we do some arithmetical instructions on the registers and then typically move the results back up the hierarchy.

Now this is somewhat of a gross simplification, and I might end up becoming embarrassed about it later. Yet the thing to keep in mind is that the CPU fetches memory from slower, bigger regions to faster, smaller regions in aligned chunks. It grabs memory by the contiguous handful. The hope for doing this is that you end up accessing that memory chunk multiple times (spatial/temporal locality) before it ends up getting evicted later.

Memory Optimization

Keeping this in mind, getting the most performance out of your code typically needs to start off prioritizing memory layout and access (besides algorithms and data structures, of course). Without efficient memory access, the fastest arithmetical instructions are hardly going to help.

One of the things worth keeping in mind here is contiguous arrays. Data arranged contiguously, and accessed in a sequential pattern, is ideal for this kind of memory hierarchy. It's because the computer might grab a big old chunk of memory (page, cache line), then we sequentially plow through it and access the entire chunk while it's in a faster form of memory prior to eviction.

Use the Data Before It's Evicted

The worst-case scenario is when you end up loading a big old chunk of memory only to use a little piece of it and then have the system evict it before we use the rest of it. Such scenarios can show up in linked structures like linked lists and trees (lacking a memory allocator to give them a more contiguous representation) where we might end up loading a chunk of memory for a memory region surrounding a node only to access one node inside of it and then evict it.

Another case where this shows up is in managed languages where each user-defined type has to be allocated separately (ex: through a garbage collector) but aggregated into an array-based list structure. In that case, even though we're storing an array of these objects, each object is actually represented through a reference (like a pointer) which points somewhere else in memory.

That might be one of the most compelling reasons to use a language like C or C++. They allow user-defined types to be aggregated contiguously as well as allocated on the stack (which has a great deal of temporal locality).

TL;DR

If you want to learn more about these subjects, I'd suggest investigating locality of reference. This article is also a must: http://lwn.net/Articles/250967/

Last but not least, if I'm allowed a shameless plug for a bounty question that I took lots of time to answer: What is the most efficient way to represent small values in a struct?.

But anyway, first things first is to grab a profiler and start chasing down hotspots. It's the fastest way to learn, and the most productive way to optimize.

Update

The wisdom advice in Jenz' fine answer also gave me an urge to include a disclaimer, in that algorithmic efficiency still tends to be the first and foremost thing to worry about. I've worked with those types that talk about cache efficiency and multithreading all day while dealing with the most sub-optimal algorithms, and that's ineffective prioritization. Micro-optimizing or parallelizing a bubble sort of a million elements is far from effective as a blatant example.

Where a lot of memory optimization techniques tend to help the most immediately is in those sequential cases where there is no choice but to touch every element (no way to get below linear complexity). An example is, say, a particle simulator which needs to process every particle, an image processing algorithm which must affect every pixel, matrix multiplication involving massive matrices. In those cases, there's no way to algorithmically skip a vast portion of the work and still get the same result, as we must process every element. For those times, memory optimization techniques can become even more effective than parallelization, as well as giving you more out of parallelization.

Yet there is also a memory efficiency concern underlying the heart of data structures and algorithms. A quicksort of an array still tends to beat merge sort in practical scenarios purely because of memory efficiency. There are even cases where a linearithmic algorithm may beat a linear one provided that the former is vastly more memory efficient.

Memory Allocators

I mentioned the cache-unfriendliness of linked structures like trees and linked lists earlier, but that's assuming each node is allocated against a general-purpose allocator (and possibly not all at once). One of the things that can start to actually make even a structure like a singly-linked list a whole lot more applicable is the use of a memory allocator which gives its nodes back the spatial locality they would normally lack otherwise. So there are ways to dig under your data structures and utilize memory allocators and make them more efficient that way without actually using a whole new one.

There are also data structures like unrolled lists which are often overlooked since they offer no algorithmic benefits over a linked list. Yet they offer substantially greater benefits from a memory efficiency perspective, and in those scenarios where we have two data structures which have similar algorithmic complexity but vastly different memory layouts, the winner tends to be the one with more efficient memory layout and access patterns. The unrolled list links arrays of elements together rather than individual elements, and, again, spatial locality strongly favors contiguous array-based representations.

Yet just about any micro-optimization will degrade the straightforwardness and maintainability of your code. So the key to optimization in general is prioritization, and that's where the profiler can at least somewhat help you stay in check (from a productivity perspective, a profiler has the enormous benefit of showing you what not to optimize that you might have otherwise been tempted to attempt).

Well, your answer is extremely detailed and thorough, and the 'profiler' may be the most useful answer, but I'm not sure if your really addressed the question. Then again, I may have not explained it enough. I gave you the upvote, but a lot of what you were discussing refers to assembly level manipulations, which I did mention were beyond the scope of the question. Incidentally, you seem to have answered the entire topic anyway. — bigcodeszzer, Dec 19 '15 at 22:00
@bigcodeszzer Oh, these are computer architecture topics but you don't have to write assembly code to benefit from memory efficiency. For example, if you write code in Java, you might avoid an `ArrayList` of `Integer` in favor of an array of `ints`, since `Integer` has to be allocated individually (dispersing memory and losing locality of reference -- specifically spatial locality). The concepts apply regardless of the language since the hardware is the same. — , Dec 19 '15 at 22:02
Right, I'm just not sure how you can actually do some of the things you spoke about in 'Memory optimization' and 'Data eviction' in languages like C, C++ or Java? — bigcodeszzer, Dec 19 '15 at 22:04
Like even in C, I didn't think you could have control over how the memory is loaded in and out of the cache? — bigcodeszzer, Dec 19 '15 at 22:05
@bigcodeszzer You can't directly control those things often even if you write assembly code. What you *can* control, for example, is to use more array-based contiguous data structures. In C and C++, you can write memory allocators which pack your memory into more of a contiguous array-like form even for structures like binary trees. That plays well to the cache, but paging and caching is done behind your back typically -- it's something out of your direct control (but you can write your code in a way that really plays well to it). — , Dec 19 '15 at 22:06
Well I certainly didn't know that, though you probably can't control that as much in language that doesn't support dynamic allocation, like Java. Unless you are saying that the goal is to just allocate a big array and then divvy it up by index? Is that what you mean by allocator? — bigcodeszzer, Dec 19 '15 at 22:09
What I'm really curious about is access. If I understand correctly, the registers have shift a reference of data? So I'm curious what is more/less efficient in terms of accessing data, as well as iterating through it. — bigcodeszzer, Dec 19 '15 at 22:10
@bigcodeszzer Somewhat. Even in Java, whenever you involve references to objects, they're allocated through GC and the GC can put them wherever in memory. When you allocate an array in Java, even if it's garbage collected, the array has to have a contiguous representation. So one strategy in a language like Java if you want to optimize is to kind of obliterate some of your objects and turn them into big old arrays of plain old data. That would give them a contiguous memory layout and spatial locality (though at potentially great cost to code maintainability). — , Dec 19 '15 at 22:11
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/98447/discussion-between-ike-and-bigcodeszzer). — , Dec 19 '15 at 22:11

score 2 · Answer 2 · answered Dec 19 '15 at 20:09

What makes code CPU efficient?

Less instructions, less branch and minimum usage of variables in the code leads to less usage of cpu resources. all these can be achieved effectively by applying efficient algorithms for your logic and reducing unnecessary codes. Try to reduce more I/O from memory which takes more time to access.

Do lower-level languages like C have more flexibility for CPU efficiency?

Job of CPU is just executing instructions, you have control only in your software to minimize the instructions. CPU efficiency is directly proportional to number of instructions. C was designed to be compiled using a relatively straightforward compiler, to provide low-level access to memory, to provide language constructs that map efficiently to machine instructions, and to require minimal run-time support. C was therefore useful for many applications that had formerly been coded in assembly language, such as in system programming.

Jens · Answer 3 · 2015-12-19T20:47:07.097

1

A general question deserves a general answer:

All optimization is an exercise in caching.

Especially on today's multi level cache architectures.

Beware the foolish idea that all you need to do is cram the code in the level 1 instruction cache, and all your data in the level 1 data cache, in order to efficiently compute that O(N²) algorithm, because along comes a genius who lives and breathes the exercise by doing the heavy-lifting with an O(1) lookup in a big table.

In other words, RAM and disk space are cheap. Use them to your advantage.

edited Dec 19 '15 at 20:47

answered Dec 19 '15 at 20:18

Jens

69,818
15
125
179

Then there is the von Neumann bottleneck ... – g24l Dec 24 '15 at 02:36

score 1 · Answer 4 · edited Apr 12 '17 at 07:31

As long as a language has a fairly decent compiler, the code it generates should be about the same as any other.

The problem with different languages is they can tempt you to do things that cost extra time. Like C++ tempts you to use new because it's so easy, and there are all kinds of container classes to make it easy to do fancy stuff. If you are working in C, it's a lot more trouble to do the fancy stuff so guess what - you don't do it (unless you really have to) and you don't pay the price in performance.

It's tempting to think all the nice features of the advanced languages are free, or at most negligible, but in fact they can multiply each other, as this example shows. You can still use the advanced languages, but if you know how to do performance tuning, you can get the benefit of their advanced features without paying for things you don't need.

CPU efficient algorithms?

4 Answers4