Hardware to software leaky abstractions: What hardware concepts can make my software run faster?

Question

So recently (today) I learned the concept of cpu branch prediction. Basically if the compares of your if statements are predictable, your code will run faster. Below is a program (console application in C#) I wrote that demonstrates this concept:

using System;
using System.Collections.Generic;
using System.Diagnostics;

/* *
 * Below are the times for branch prediction / misfires in seconds:
 * 
 * Predictable:      0.91
 * Unpredictable:    1.61
 * 
 * Summary: When the branch is predictable, the program runs 55% faster.
 */
namespace BranchPredictions
{
    class Program
    {
        static void Main(string[] args)
        {
            const int MAX = 100000000; // The amount of branches to create
            bool predictable = true; // When true the list isn't in a predictable order
            var nums = new List<int>(MAX);
            var random = new Random();

            for (int i = 0; i < MAX; i++)
            {
                if (predictable)
                {
                    nums.Add(i);
                }
                else
                {
                    nums.Add(random.Next());
                }
            }

            int count = 0;
            var sw = Stopwatch.StartNew();
            foreach (var num in nums)
            {
                if (num % 2 == 0)  // Here is the branch
                {
                    count++;
                }
            }
            sw.Stop();

            Console.WriteLine("Total count: {0}", count);
            Console.WriteLine("Time taken: {0}", sw.Elapsed);

            if (Debugger.IsAttached)
            {
                Console.Write("Press any key to continue..");
                Console.ReadKey(true);
            }
        }
    }
}

It opened my eyes, by knowing some hardware concepts, I can make certain code run much faster, without any real code changes at all!

But it makes me wonder, what else is there that hardware does that call also make my software run faster if I'm aware if it?

I use windows and C#, but these concepts should work for all computers and languages.

Reminds me of SO's highest voted question: [Why is processing a sorted array faster than an unsorted array?](http://stackoverflow.com/q/11227809/86967) — Brent Bradburn, Mar 08 '14 at 06:08
@dwelch why is timing the program an invalid measure of performance? Since I only timed the looped if statement where I expect branch predictions? — user1515024, Mar 08 '14 at 06:10
@dwelch Fair enough, what other kind of optimization could it possibly be? Considering the performance is directly affected by the order of odd / even numbers in the list? — user1515024, Mar 08 '14 at 06:13
@dwelch I'm confused, how is the code different once the stopwatch starts? It tests N integers and counts the even ones. Far as I see it the code is the same once I start measuring the branch prediction. — user1515024, Mar 08 '14 at 06:19
@dwelch You seem to assume that OP wants to compare the performance with and without branch prediction, which I find highly unlikely. Looks like he just wanted to demonstrate that, if there is branch prediction, a predictable branch is better. This benchmark shows that. Also, the time taken by `nums.Add(whatever)` part is irrelevant since it's not being measured. — harold, Mar 08 '14 at 09:06
@dwelch maybe you're reading the code wrong? The call to rand is irrelevant, except that it changes the data that will be branched on. Any difference in time can only from the branch being correctly predicted or not. — harold, Mar 08 '14 at 15:55
Im out, will leave it to you to figure out what is really going on here and why there was a difference in execution time. — old_timer, Mar 08 '14 at 16:19

Aki Suihkonen · Answer 1 · 2014-03-09T16:37:46.250

First as a side note to the discussion of branch prediction -- one can hint the compiler about the predictability problem to produce constant speed code that uses e.g. cmov. One can also often convert if-else-constructions to constant time expressions, such as count += 1 - (num % 2); to validate any branch prediction hypothesis.

Other major HW concepts to exploit / consider, are memory bandwidth and cache. Splitting large calculation of e.g. 10000x10000 array to 1250x1250 x 8x8 blocks exploit the concepts of cache locality.

Memory bandwidth can taking account by "micro managing" the size of array elements by not using necessarily "int" when "char" is sufficient.

N-way associativity of a cache can cause some array lengths to be faster than others, since some memory address compete over the same cache-lines; the cure is to over-allocate.

Loop-unrolling hits often more to creating independent flows of variable dependencies than branch predicting. Without going directly to parallel programming, mixing two independent processing tasks within the same loop can in some cases double or triple the speed by finding something useful calculation for the processor, when some instruction waits for the result of previous. Instruction throughput vs. latency explains this phenomenon.

score 1 · Accepted Answer · answered Mar 08 '14 at 09:56

Look into cache and its (sometimes crazy-seeming) effects. Here's a nice paper to get you started: What Every Programmer Should Know About Memory

A lot of the time, cache is "just there", quietly making your program run much faster than if it had to hit dram all the time. But how well it can do its job depends on how you access memory. Some common programming constructs help it (like iterating over an array in a simple manner), which other common programming constructs are actually terrible (like walking through a linked list).

The world of caches is often contrary to conventional wisdom. For example, the good old "time/memory" trade-off which often applies, is also often completely reversed - less memory often means fewer cache misses, and that may matter more than performing more math (which has a trivial cost in comparison - you can easily execute a few hundred simple instructions in how much time a last-level cache miss takes). Also, it's almost always the case that an array-list is faster than a linked list, even if you do many insertions and removals, which are what a linked list is supposed to be good at. Linked lists don't play well with cache - they waste a lot of it on pointers (1 or 2 per item), and they often access memory in an unpredictable pattern.

Also counter-intuitive at first is the simple effect that if you access some memory, accessing it (or near it, in the same cache line) again soon is essentially free. Code is often "optimized" by removing redundant reads, but they're not really the problem. The real problem is whenever a read misses the cache. Cache hits are not completely free though, if you do enough of them you will see their impact. But don't focus on them, focus on misses.

But the world of caches is even weirder. Associativity effects can make accessing an array with certain strides suddenly much slower than slightly different strides. This is often seen in matrixes where you iterate over a column (if it's a row-major matrix), which translates to accessing a 1D-array with a stride equal to the width of the matrix. Certain widths (usually multiples of some power of two) suddenly make that process unexpectedly slow.

As if that's not enough, observe that since a cache is limited in size, using it for one thing means something else goes out. That causes non-local effects. Particularly, if you call a subroutine, some things that used to be in cache now may not be. This can cause situations where if you make the subroutine faster at the expense of more cache usage, it may make the caller slower - perhaps by as much as (or more than) you've saved in the subroutine.

So here's a couple of simple things you can do:

split "hot" and "cold" data. That means that when you access the hot data, you won't waste precious cache space on data you weren't going to use.
consider the Struct of Arrays layout, which basically means "split everything". Splitting everything will automatically split hot and cold data, in a way such that if the sets of hot and cold change, it will still be split exactly right. Vectorization (which is also something you should look into) also benefits from this layout.
if you have a very big array of some object, try to make the object smaller. Use float instead of double, use shorter integer types (even bitfields), and use array indices instead of pointers (especially in 64bit). Even if that means using constructs that are generally thought of as slower, the additional processing may be well worth the saved cache misses.
try using a data structure that is asymptotically slower, but more memory efficient (ie vector instead of set, that sort of thing)

None of these tips are really magic. Always benchmark everything.

Hardware to software leaky abstractions: What hardware concepts can make my software run faster?

2 Answers2