53

I'm interested in how efficient low-level algorithms can be in .net. I would like to enable us to choose to write more of our code in C# rather than C++ in the future, but one stumbling block is the bounds checking in .net that occurs with looping and random access to arrays.

A motivating example is a function that calculates the sum of products of corresponding elements in two arrays (this is the dot product of two vectors).

static void SumProduct(double[] X, double[] Y)
{
    double sum = 0;
    int length = X.Length;
    if (length != Y.Length)
        throw new ArgumentException("X and Y must be same size");
    for (int i = 0; i < length; i++) // Check X.Length instead? See below
        sum += X[i] * Y[i];
}

From what I can tell, and don't know enough IL or x86 to check, the compiler won't optimize out bounds checking of X and Y. Am I wrong and/or is there a way to write my code to allow the compiler to help me out?

Further details

There are many efficiency-arguments for and against using particular languages, not least that it is better to concentrate on "big O" algorithmic cost rather than the constant of proportionality, and higher level languages help you to do this. On the subject of bounds checking in .net, the best article I found is Array Bounds Check Elimination in the CLR on MSDN (also referenced in a stack overflow answer on the importance of enabling optimization).

This dates from 2009, so I wonder whether things have changed significantly since then. Also, the article reveals some real subtleties that would have caught me out so for this reason alone I would welcome some expert advice.

For example it appears that in my code above I would have better off writing i< X.Length rather than i < length. Also, I had also naively assumed that for an algorithm with a single array, writing a foreach loop would better declare your intent to the compiler and give it the best chance of optimizing out the bounds checking.

According to the MSDN article, SumForBAD, below, which I thought was sure to be optimized, would not be. Whereas SumFor would be straightforwardly optimized, and SumForEach would also be optimized, but not trivially (and might not be optimized at all if the array were passed into a function as IEnumerable<int>)?

static double SumForBAD(double[] X)
{
    double sum = 0;
    int length = X.Length; // better to use i < X.length in loop
    for (int i = 0; i < length; i++)
        sum += X[i];
    return sum;
}

static double SumFor(double[] X)
{
    double sum = 0;
    for (int i = 0; i < X.Length; i++)
        sum += X[i];
    return sum;
}

static double SumForEach(double[] X)
{
    double sum = 0;
    foreach (int element in X)
        sum += element;
    return sum;
}

I did some investigation based on doug65536's answer. In C++, I compared the times of a SumProduct that does one bounds-check

for(int i=0; i<n; ++i) sum += v1[i]*v2[i];

against another version that does two bounds-checks

for(int i=0; i<n1 && i <n2; ++i) sum += v1[i]*v2[i];

I found that the second version was slower, but only by about 3.5% (Visual Studio 2010, optimized build, default options). However it occurred to me that in C#, there might be three bounds checks. One explicit (i < length in the function static void SumProduct(double[] X, double[] Y) at the start of this question), and two implicit (X[i] and Y[i]). So I tested a third C++ function, with three bounds checks

for(int i=0; i<n1 && i <n2 && i <n3; ++i) sum += v1[i]*v2[i];

This came in 35% slower than the first, which is worth caring about. I did some more investigation in this question, Why does adding extra check in loop make big difference on some machines, and small difference on others?. Interestingly, it seems as though the cost of bounds checking varies significantly on different machines.

Community
  • 1
  • 1
TooTone
  • 7,129
  • 5
  • 34
  • 60

4 Answers4

40

The bounds check won't matter because:

  • The bounds check consists of a cmp/jae instruction pair, which is fused into a single micro-op on modern CPU architectures (the term is "macro-op fusion"). Compare and branch is very highly optimized.

  • The bounds check is a forward branch, which will be statically predicted to be not-taken, also reducing the cost. The branch will never be taken. (If it ever is taken, an exception will throw anyway, so the mispredict cost becomes utterly irrelevant)

  • As soon as there is any memory delay, speculative execution will queue up many iterations of the loop, so the cost of decoding the extra instruction pair almost disappears.

Memory access will likely be your bottleneck, so the effect micro-optimizations like removing bounds checks will disappear.

doug65536
  • 6,562
  • 3
  • 43
  • 53
  • thanks, I hadn't thought about it from that point of view. I'm going to try to make some time to look a bit further into what you wrote. – TooTone May 23 '13 at 20:02
  • 2
    I have just tried some performance measurements in C++. A dot product function that has two array bounds checks, as in `for(int i=0; i – TooTone Jun 12 '13 at 14:00
  • 1
    @TooTone there are cases where the compiler will omit unnecessary bounds checks. My understanding is, if the loop condition already bounds checks it (by testing i against v1.Length and v2.Length), then it can elide the bounds check on access. – doug65536 Jun 12 '13 at 15:46
  • @TooTone ah, just noticed your edit to the question. Didn't realize you were already aware of possible bounds check elimination. – doug65536 Jun 12 '13 at 15:53
  • actually I hadn't thought of that, but it makes sense. I made the C++ code I wrote into [another question](http://stackoverflow.com/questions/17070607/why-does-adding-extra-check-in-loop-make-such-a-difference-to-performance), where there is already an interesting comment about compiler optimizations. – TooTone Jun 12 '13 at 16:42
  • 7
    I feel this answer could be modified slightly to make it less misleading. It *may* be the case that the bounds check overhead will not matter with this particular example, on a particular cpu, since only a sum is happening in the loop and the data type is rather wide. However it is not uncommon for array bounds overhead to absolutely have a significant impact in loops. One should measure to be sure. – jackmott Nov 03 '16 at 16:58
  • @jackmott If you have an extremely small loop body, the problem is the extremely small loop body. Extremely small loop bodies are so easy to execute that you become limited by memory bandwidth. Measurements will often blame the compare and branch when stalled on memory load/store queue full. Cycle measurement tools will often give misleading results in extreme cases. – doug65536 Nov 03 '16 at 18:03
  • 1
    @jackmott All the bounds checks in the world won't amount to a hill of beans compared to abuse of the garbage collector that is typical of C# programs. – doug65536 Nov 03 '16 at 21:08
  • @doug65536: What do you mean? Since .Net 3.5 (IIRC) the garbage collector runs on a separate thread, causing its effect on performance to be minimal _(unless your software needs every core)_ – BlueRaja - Danny Pflughoeft Jun 10 '17 at 19:21
  • the saying 'no code is better than no code" is due. "academically unnecessary bounds check is unnecessary.", too. academically unnecessary? take an array, fixed in size by definition, performing a bounds check is only necessary when your compiler is incapable of determining a bounds check is unnecessary. likewise, saying a compile-time bounds check is unn. because the intel arch behaves a certain way only leads otherwise good developers down the path of writing code that, today, is not optimized on all platforms, or platforms of the future. – Shaun Wilson Oct 07 '17 at 01:58
30

64-bit

The 64-bit jitter does a good job of eliminating bounds checks (at least in straightforward scenarios). I added return sum; at the end of your method and then compiled the program using Visual Studio 2010 in Release mode. In the disassembly below (which I annotated with a C# translation), notice that:

  • There are no bounds checks for X, even though your code compares i against length instead of X.Length. This is an improvement over the behavior described in the article.
  • Before the main loop, there is a single check to make sure that Y.Length >= X.Length.
  • The main loop (offsets 00000032 through 00000052) does not contain any bounds checks.

Disassembly

; Register assignments:
;    rcx  := i
;    rdx  := X
;    r8   := Y
;    r9   := X.Length ("length" in your code, "XLength" below)
;    r10  := Y.Length ("YLength" below)
;    r11  := X.Length - 1 ("XLengthMinus1" below)
;    xmm1 := sum

; (Prologue)
00000000  push        rbx
00000001  push        rdi
00000002  sub         rsp,28h

; (Store arguments X and Y in rdx and r8)
00000006  mov         r8,rdx   ; Y
00000009  mov         rdx,rcx  ; X

; int XLength = X.Length;
0000000c  mov         r9,qword ptr [rdx+8]

; int XLengthMinus1 = XLength - 1;
00000010  movsxd      rax,r9d
00000013  lea         r11,[rax-1]

; int YLength = Y.Length;
00000017  mov         r10,qword ptr [r8+8]

; if (XLength != YLength)
;     throw new ArgumentException("X and Y must be same size");
0000001b  cmp         r9d,r10d
0000001e  jne         0000000000000060

; double sum = 0;
00000020  xorpd       xmm1,xmm1

; if (XLength > 0)
; {
00000024  test        r9d,r9d
00000027  jle         0000000000000054

;     int i = 0;
00000029  xor         ecx,ecx
0000002b  xor         eax,eax

;     if (XLengthMinus1 >= YLength)
;         throw new IndexOutOfRangeException();
0000002d  cmp         r11,r10
00000030  jae         0000000000000096

;     do
;     {
;         sum += X[i] * Y[i];
00000032  movsd       xmm0,mmword ptr [rdx+rax+10h]
00000038  mulsd       xmm0,mmword ptr [r8+rax+10h]
0000003f  addsd       xmm0,xmm1
00000043  movapd      xmm1,xmm0

;         i++;
00000047  inc         ecx
00000049  add         rax,8

;     }
;     while (i < XLength);
0000004f  cmp         ecx,r9d
00000052  jl          0000000000000032
; }

; return sum;
00000054  movapd      xmm0,xmm1

; (Epilogue)
00000058  add         rsp,28h
0000005c  pop         rdi
0000005d  pop         rbx
0000005e  ret

00000060  ...

00000096  ...

32-bit

The 32-bit jitter, unfortunately, is not quite as smart. In the disassembly below, notice that:

  • There are no bounds checks for X, even though your code compares i against length instead of X.Length. Again, this is an improvement over the behavior described in the article.
  • The main loop (offsets 00000018 through 0000002a) contains a bounds check for Y.

Disassembly

; Register assignments:
;    eax  := i
;    ecx  := X
;    edx  := Y
;    esi  := X.Length ("length" in your code, "XLength" below)

; (Prologue)
00000000  push        ebp
00000001  mov         ebp,esp
00000003  push        esi

; double sum = 0;
00000004  fldz

; int XLength = X.Length;
00000006  mov         esi,dword ptr [ecx+4]

; if (XLength != Y.Length)
;     throw new ArgumentException("X and Y must be same size");
00000009  cmp         dword ptr [edx+4],esi
0000000c  je          00000012
0000000e  fstp        st(0)
00000010  jmp         0000002F

; int i = 0;
00000012  xor         eax,eax

; if (XLength > 0)
; {
00000014  test        esi,esi
00000016  jle         0000002C

;     do
;     {
;         double temp = X[i];
00000018  fld         qword ptr [ecx+eax*8+8]

;         if (i >= Y.Length)
;             throw new IndexOutOfRangeException();
0000001c  cmp         eax,dword ptr [edx+4]
0000001f  jae         0000005A

;         sum += temp * Y[i];
00000021  fmul        qword ptr [edx+eax*8+8]
00000025  faddp       st(1),st

;         i++;
00000027  inc         eax

;     while (i < XLength);
00000028  cmp         eax,esi
0000002a  jl          00000018
; }

; return sum;
0000002c  pop         esi
0000002d  pop         ebp
0000002e  ret

0000002f  ...

0000005a  ...

Summing Up

The jitter has improved since 2009, and the 64-bit jitter can generate more efficient code than the 32-bit jitter.

If necessary, though, you can always bypass array bounds checks completely by using unsafe code and pointers (as svick points out). This technique is used by some performance-critical code in the Base Class Library.

Michael Liu
  • 52,147
  • 13
  • 117
  • 150
  • 2
    x64 isn't actually better, even though it knows to eliminate the bounds check. You get the check for free on x86 because execution of the check overlaps with the execution of the FPU instructions. A feature of modern super-scalar processor cores. – Hans Passant Jun 17 '13 at 00:41
  • @HansPassant: The check may be free in this particular case, but what about in general? – Michael Liu Jun 17 '13 at 02:43
  • 2
    In general, the jitters know how to eliminate bounds checks. Give or take. And when they slip up then it tends to not matter that much since the check is so cheap, memory is so slow and processors are so awesome :) – Hans Passant Jun 17 '13 at 08:06
  • Michael thanks a lot for the effort. @HansPassant I'm beginning to appreciate the awesomeness of modern processors! My impression is that what modern processors can't get in raw speed they are getting instead in terms of keeping the pipeline full, optimistic scheduling etc. In general is there a good source for reading up about the awesomeness of modern processors, or this best appreciated by experience / participating in discussions like this? – TooTone Jun 18 '13 at 10:45
  • 2
    The Intel processor manuals are a resource, but not a great one. Google "Agner Fog", he's the leading authority. – Hans Passant Jun 18 '13 at 11:13
  • Your disassembly is cool. I have not been able to figure out how to create the x64 assembler for x64 JIT version of a .NET assembly. How did you do it? – Mark R May 30 '21 at 22:24
12

One way to be sure that bounds checking is not performed is to use pointers, which you can do in C# in unsafe mode (this requires you to set a flag in the project properties):

private static unsafe double SumProductPointer(double[] X, double[] Y)
{
    double sum = 0;
    int length = X.Length;
    if (length != Y.Length)
        throw new ArgumentException("X and Y must be same size");
    fixed (double* xp = X, yp = Y)
    {
        for (int i = 0; i < length; i++)
            sum += xp[i] * yp[i];
    }
    return sum;
}

I tried measuring your original method, your method with the X.Length change and my code using pointers, compiled both as x86 and x64 under .Net 4.5. Specifically, I tried computing the method for vectors of length 10 000 and ran the method 10 000 times.

The results are pretty much in line with Michael Liu's answer: there is no measurable difference between the three methods, which means that bounds checking either isn't done or that its effect on performance is insignificant. There was measurable difference between x86 and x64 though: x64 was about 34 % slower.

Full code I used:

static void Main()
{
    var random = new Random(42);
    double[] x = Enumerable.Range(0, 10000).Select(_ => random.NextDouble()).ToArray();
    double[] y = Enumerable.Range(0, 10000).Select(_ => random.NextDouble()).ToArray();

    // make sure JIT doesn't affect the results
    SumProduct(x, y);
    SumProductLength(x, y);
    SumProductPointer(x, y);

    var stopwatch = new Stopwatch();
    stopwatch.Start();
    for (int i = 0; i < 10000; i++)
    {
        SumProduct(x, y);
    }
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
    stopwatch.Restart();
    for (int i = 0; i < 10000; i++)
    {
        SumProductLength(x, y);
    }
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
    stopwatch.Restart();
    for (int i = 0; i < 10000; i++)
    {
        SumProductPointer(x, y);
    }
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

private static double SumProduct(double[] X, double[] Y)
{
    double sum = 0;
    int length = X.Length;
    if (length != Y.Length)
        throw new ArgumentException("X and Y must be same size");
    for (int i = 0; i < length; i++)
        sum += X[i] * Y[i];
    return sum;
}

private static double SumProductLength(double[] X, double[] Y)
{
    double sum = 0;
    if (X.Length != Y.Length)
        throw new ArgumentException("X and Y must be same size");
    for (int i = 0; i < X.Length; i++)
        sum += X[i] * Y[i];
    return sum;
}

private static unsafe double SumProductPointer(double[] X, double[] Y)
{
    double sum = 0;
    int length = X.Length;
    if (length != Y.Length)
        throw new ArgumentException("X and Y must be same size");
    fixed (double* xp = X, yp = Y)
    {
        for (int i = 0; i < length; i++)
            sum += xp[i] * yp[i];
    }
    return sum;
}
svick
  • 236,525
  • 50
  • 385
  • 514
  • Test code v good idea! I ran your code on a regular PC (x86 Family 6 Model 30 Stepping 5 GenuineIntel ~2793 Mhz) and got 1200 / 1000 / 1300, approx, after increasing number of iterations by a factor of 10, on .net 4. On my modern laptop (Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz) I saw exactly the same as you: 970 / 970 / 970. For the laptop I tried .net 4 as well as 4.5, and putting the code into a separate assembly (to prevent inlining), but the results were the same. I suspect that the more modern architecture of the laptop is helping. – TooTone Jun 17 '13 at 18:31
  • I reduced the array sizes to 100 and increased the iteration counts to 10000000. With a Core i7-2600 CPU @ 3.40GHz, the results for x86 and x64 are 844/839/920 and 1074/1074/1153 respectively (median of 9 runs). `SumProduct` and `SumProductLength` are co-winners (their native code is almost, but not entirely, identical), while `SumProductPointer` is the loser. – Michael Liu Jun 18 '13 at 03:00
  • @MichaelLiu I just ran my code on a Core i5-2500 @ 3.30GHz, and saw no appreciable difference: all three were approx 1100ms (over several runs). I think it is interesting that on systems where we did see a difference, the native code performs worst. There are some related comments on the question I wrote on [simulating bounds checking in C++](http://stackoverflow.com/questions/17070607/why-does-adding-extra-check-in-loop-make-big-difference-on-some-machines-and-sm). – TooTone Jun 18 '13 at 10:34
0

First of all, I would like to thank everyone who spoken out in this post, from original OP to the guys who provided extremely detailed and insightful explanations. I really, really enjoyed reading the existing answers. Since there is already plentiful of theory of how and why the loops work in the way they do, I would like to offer some empirical (by some definition authoritative) measurements:

Conclusions:

  • Foreach loop is faster than For loop.
  • Local variable is faster than the array .Length property.
  • GC-pinning using unsafe fixed is not faster than normal For loop.

Benchmarking code:

using System;
using System.Diagnostics;
using System.Runtime;

namespace demo
{
    class MainClass
    {
        static bool ByForArrayLength (byte[] data)
        {
            for (int i = 0; i < data.Length; i++)
                if (data [i] != 0)
                    return false;
            return true;
        }

        static bool ByForLocalLength (byte[] data)
        {
            int len = data.Length;
            for (int i = 0; i < len; i++)
                if (data [i] != 0)
                    return false;
            return true;
        }

        static unsafe bool ByForUnsafe (byte[] data)
        {
            fixed (byte* datap = data)
            {
                int len = data.Length;
                for (int i = 0; i < len; i++)
                    if (datap [i] != 0)
                        return false;
                return true;
            }
        }

        static bool ByForeach (byte[] data)
        {
            foreach (byte b in data)
                if (b != 0)
                    return false;
            return true;
        }

        static void Measure (Action work, string description)
        {
            GCSettings.LatencyMode = GCLatencyMode.LowLatency;
            var watch = Stopwatch.StartNew ();
            work.Invoke ();
            Console.WriteLine ("{0,-40}: {1} ms", description, watch.Elapsed.TotalMilliseconds);
        }

        public static void Main (string[] args)
        {
            byte[] data = new byte[256 * 1024 * 1024];
            Measure (() => ByForArrayLength (data), "For with .Length property");
            Measure (() => ByForLocalLength (data), "For with local variable");
            Measure (() => ByForUnsafe (data), "For with local variable and GC-pinning");
            Measure (() => ByForeach (data), "Foreach loop");
        }
    }
}

Results: (uses Mono runtime)

$ mcs Program.cs -optimize -unsafe
For with .Length property               : 440,9208 ms
For with local variable                 : 333,2252 ms
For with local variable and GC-pinning  : 330,2205 ms
Foreach loop                            : 280,5205 ms
ArekBulski
  • 4,520
  • 4
  • 39
  • 61