12

I have three cases to test the relative performance of classes, classes with inheritence and structs. These are to be used for tight loops so performance counts. Dot products are used as part of many algorithms in 2D and 3D geometry and I have run the profiler on real code. The below tests are indicative of real world performance problems I have seen.

The results for 100000000 times through the loop and application of the dot product gives

ControlA 208 ms   ( class with inheritence )
ControlB 201 ms   ( class with no inheritence )
ControlC 85  ms   ( struct )

The tests were being run without debugging and optimization turned on. My question is, what is it about classes in this case that cause them to be so slow?

I presumed the JIT would still be able to inline all the calls, class or struct, so in effect the results should be identical. Note that if I disable optimizations then my results are identical.

ControlA 3239
ControlB 3228
ControlC 3213

They are always within 20ms of each other if the test is re-run.

The classes under investigation

using System;
using System.Diagnostics;

public class PointControlA
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public PointControlA(double x, double y)
    {
        X = x;
        Y = y;
    }
}

public class Point3ControlA : PointControlA
{
    public double Z
    {
        get;
        set;
    }

    public Point3ControlA(double x, double y, double z): base (x, y)
    {
        Z = z;
    }

    public static double Dot(Point3ControlA a, Point3ControlA b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

public class Point3ControlB
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public double Z
    {
        get;
        set;
    }

    public Point3ControlB(double x, double y, double z)
    {
        X = x;
        Y = y;
        Z = z;
    }

    public static double Dot(Point3ControlB a, Point3ControlB b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

public struct Point3ControlC
{
    public double X
    {
        get;
        set;
    }

    public double Y
    {
        get;
        set;
    }

    public double Z
    {
        get;
        set;
    }

    public Point3ControlC(double x, double y, double z):this()
    {
        X = x;
        Y = y;
        Z = z;
    }

    public static double Dot(Point3ControlC a, Point3ControlC b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z;
    }
}

Test Script

public class Program
{
    public static void TestStructClass()
    {
        var vControlA = new Point3ControlA(11, 12, 13);
        var vControlB = new Point3ControlB(11, 12, 13);
        var vControlC = new Point3ControlC(11, 12, 13);
        var sw = Stopwatch.StartNew();
        var n = 10000000;
        double acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlA.Dot(vControlA, vControlA);
        }

        Console.WriteLine("ControlA " + sw.ElapsedMilliseconds);
        acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlB.Dot(vControlB, vControlB);
        }

        Console.WriteLine("ControlB " + sw.ElapsedMilliseconds);
        acc = 0;
        sw = Stopwatch.StartNew();
        for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

        Console.WriteLine("ControlC " + sw.ElapsedMilliseconds);
    }

    public static void Main()
    {
        TestStructClass();
    }
}

This dotnet fiddle is proof of compilation only. It does not show the performance differences.

I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea. I now have the test case to prove it but I can't understand why.

NOTE : I have tried to set a breakpoint in the debugger with JIT optimizations turned on but the debugger will not break. Looking at the IL with JIT optimizations turned off doesn't tell me anything.

EDIT

After the answer by @pkuderov I took his code and played with it. I changed the code and found that if I forced inlining via

   [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static double Dot(Point3Class a)
    {
        return a.X * a.X + a.Y * a.Y + a.Z * a.Z;
    }

the difference between the struct and class for dot product vanished. Why with some setups the attribute is not needed but for me it was is not clear. However I did not give up. There is still a performance problem with the vendor code and I think the DotProduct is not the best example.

I modified @pkuderov's code to implement Vector Add which will create new instances of the structs and classes. The results are here

https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48

In the example I also modifed the code to pick a pseudo random vector from an array to avoid the problem of the instances sticking in the registers ( I hope ).

The results show that:

DotProduct performance is identical or maybe faster for classes
Vector Add, and I assume anything that creates a new object is slower.

Add class/class 2777ms Add struct/struct 2457ms

DotProd class/class 1909ms DotProd struct/struct 2108ms

The full code and results are here if anybody wants to try it out.

Edit Again

For the vector add example where an array of vectors is summed together the struct version keeps the accumulator in 3 registers

 var accStruct = new Point3Struct(0, 0, 0);
 for (int i = 0; i < n; i++)
     accStruct = Point3Struct.Add(accStruct, pointStruct[(i + 1) % m]);

the asm body is

// load the next vector into a register
00007FFA3CA2240E  vmovsd      xmm3,qword ptr [rax]  
00007FFA3CA22413  vmovsd      xmm4,qword ptr [rax+8]  
00007FFA3CA22419  vmovsd      xmm5,qword ptr [rax+10h]  
// Sum the accumulator (the accumulator stays in the registers )
00007FFA3CA2241F  vaddsd      xmm0,xmm0,xmm3  
00007FFA3CA22424  vaddsd      xmm1,xmm1,xmm4  
00007FFA3CA22429  vaddsd      xmm2,xmm2,xmm5  

but for class based vector version it reads and writes out the accumulator each time to main memory which is inefficient

var accPC = new Point3Class(0, 0, 0);
for (int i = 0; i < n; i++)
    accPC = Point3Class.Add(accPC, pointClass[(i + 1) % m]);

the asm body is

// Read and add both accumulator X and Xnext from main memory
00007FFA3CA2224A  vmovsd      xmm0,qword ptr [r14+8]     
00007FFA3CA22250  vmovaps     xmm7,xmm0                   
00007FFA3CA22255  vaddsd      xmm7,xmm7,mmword ptr [r12+8]  


// Read and add both accumulator Y and Ynext from main memory
00007FFA3CA2225C  vmovsd      xmm0,qword ptr [r14+10h]  
00007FFA3CA22262  vmovaps     xmm8,xmm0  
00007FFA3CA22267  vaddsd      xmm8,xmm8,mmword ptr [r12+10h] 

// Read and add both accumulator Z and Znext from main memory
00007FFA3CA2226E  vmovsd      xmm9,qword ptr [r14+18h]  
00007FFA3CA22283  vmovaps     xmm0,xmm9  
00007FFA3CA22288  vaddsd      xmm0,xmm0,mmword ptr [r12+18h]

// Move accumulator accumulator X,Y,Z back to main memory.
00007FFA3CA2228F  vmovsd      qword ptr [rax+8],xmm7  
00007FFA3CA22295  vmovsd      qword ptr [rax+10h],xmm8  
00007FFA3CA2229B  vmovsd      qword ptr [rax+18h],xmm0  
bradgonesurfing
  • 30,949
  • 17
  • 114
  • 217
  • 4
    So we're talking about few milliseconds after 100 million iterations? – Tim Schmelter Jul 06 '17 at 12:56
  • 1
    I don't see you warm up the classes/methods. – Patrick Hofman Jul 06 '17 at 12:57
  • see https://curryncode.com/2016/09/28/coding-for-performance-struct-vs-class/ and https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/choosing-between-class-and-struct. however...do you know that this small difference will really impact code performance? network access, db access, filesystem access will all cause your program to be slower than a class vs struct. Is this a case of premature optimization? – Jeremy Jul 06 '17 at 12:58
  • What happens if you switch the code around? First C, then B and then A? Does it matter for the performance? – Patrick Hofman Jul 06 '17 at 12:58
  • See also: [Choosing Between Class and Struct](https://learn.microsoft.com/dotnet/standard/design-guidelines/choosing-between-class-and-struct) -- Also, beware of [benchmarking mistakes](https://ericlippert.com/2013/05/14/benchmarking-mistakes-part-one/); you might want to take a look at [BenchmarkDotNet](https://www.nuget.org/packages/BenchmarkDotNet/). – Corak Jul 06 '17 at 12:59
  • 2
    The test is not very relevant for real cases. The usual culprits are memory access (TLB, L2/L3 miss) and branch misprediction, none of them viably tested by calling repeatedly the same object, with same values. – Remus Rusanu Jul 06 '17 at 13:00
  • Switching the scripts around shows Case C is still almost 3 times faster – bradgonesurfing Jul 06 '17 at 13:00
  • 2
    `I am trying to explain to a vendor why their choice to use classes instead of structs for small numeric types is a bad idea` unless you actually **measured** the performance and **find** that the use of class is a bottleneck, you have really no argument, *even with* your findings. – Remus Rusanu Jul 06 '17 at 13:04
  • 100ms difference would only really matter if you were to run that about 100/1000 times more than you actually do. But I highly doubt your application will be doing that... – Camilo Terevinto Jul 06 '17 at 13:06
  • 1
    It makes a difference if I want real time performance and user feedback on parameter changes that affect the interaction of large triangle meshes. We aren't scraping twitter feeds here :) – bradgonesurfing Jul 06 '17 at 13:13
  • @bradgonesurfing I highly doubt you will be passing the exact same parameters then. Why don't you create a more realistic test case (say some hundreds of different parameters)? – Camilo Terevinto Jul 06 '17 at 13:16
  • While I disagree with most of the naive criticisms of performance testing methodology mentioned here, @RemusRusanu is dead right, you cannot measure anything of importantce by calling the same object in a tight loop. You need to alter this test so that it uses different objects. – RBarryYoung Jul 06 '17 at 13:18
  • the code won't even compile for me.. – David Haim Jul 06 '17 at 13:27
  • @CamiloTerevinto I modified the test to change the value of the inputs each iterations. The performance descrepancy still stands but I now have the overhead of construction as well. – bradgonesurfing Jul 06 '17 at 13:28
  • @DavidHaim Probably on "_Output". Replace that with Console.WriteLine – bradgonesurfing Jul 06 '17 at 13:28
  • 1
    @bradgonesurfing it complains about X,Y and Z assignments. please put a complete code which actually compiles. – David Haim Jul 06 '17 at 13:32
  • So I tried this under 4 different configurations: 32-bit non-optimized, 32-bit optimized, 64-bit non-optimized, 64-bit optimized. In each and every case, on my machine, the struct performed the same as the class. Except in the case of 32-bit optimized where the struct performed 4 times *worse*. I even created a few more variants to test and results were similar. I think you have no worries. Design for correctness first. If performance is an issue and this is the bottleneck, then measure and fix. Not before. – Jesse C. Slicer Jul 06 '17 at 13:53
  • I voted to close this question as the code doesn't compile, and therefore the question is un-answerable. If, theoretically, we could have compile the coed, we would be able to see the generated assembly and actually give concrete answer not based on guesses. – David Haim Jul 06 '17 at 15:16
  • @DavidHaim I think you were using an old compiler. C# 7 compiler handles it correctly. Anyway I've backported the code https://dotnetfiddle.net/UBk2WC. And looking at the assembly does not always help. I have done this but it tells you nothing about what the JIT actually does and what ends up actually running. – bradgonesurfing Jul 07 '17 at 05:25
  • What happens with `[MethodImpl(MethodImplOptions.AggressiveInlining)]` outside the containing assembly? Will it be inlined on the fly even there? – abenci Jul 07 '17 at 07:26
  • 1
    the JIT can inline across assembly boundaries. – bradgonesurfing Jul 07 '17 at 07:29

2 Answers2

4

Update

After spending some time thinking about problem I think I'm aggree with @DavidHaim that memory jump overhead is not the case here because of caching.

Also I've added to your tests more options (and removed first one with inheritance). So I have:

  • cl = variable of class with 3 points:
    • Dot(cl, cl) - initial method
    • Dot(cl) - which is "square product"
    • Dot(cl.X, cl.Y, cl.Z, cl.X, cl.Y, cl.Z) aka Dot(cl.xyz)- pass fields
  • st = variable of struct with 3 points:
    • Dot(st, st) - initial
    • Dot(st) - square product
    • Dot(st.X, st.Y, st.Z, st.X, st.Y, st.Z) aka Dot(st.xyz) - pass fields
  • st6 = vairable of struct with 6 points:
    • Dot(st6) - wanted to check if size of struct matters
  • Dot(x, y, z, x, y, z) aka Dot(xyz) - just local const double variables.

Result times are:

  • Dot(cl.xyz) is the worst ~570ms,
  • Dot(st6), Dot(st.xyz) is the second worst ~440ms and ~480ms
  • the others are ~325ms

...And I don't really sure why I see these results.

Maybe for plain primitive types compiler does more aggresive pass by register optimizations, maybe it's more sure of lifetime boundaries or constantness and then more aggressive optimizations again. Maybe some kind of loop unwinding.

I think my expertise is just not enough :) But still, my results counter your results.

Full test code with results on my machine and generated IL code you can find here.


In C# classes are reference types and structs are value types. One major effect is that value types can be (and most of the time are!) allocated on the stack, while reference types are always allocated on the heap.

So every time you get access to the inner state of a reference type variable you need to dereference the pointer to memory in the heap (it's a kind of jump), while for value types it's already on the stack or even optimized out to registers.

I think you see a difference because of this.

P.S. btw, by "most of the time are" I meant boxing; it's a technique used to place value type objects on the heap (e.g. to cast value types to an interface or for dynamic method call binding).

pkuderov
  • 3,501
  • 2
  • 28
  • 46
  • 2
    this is wrong. you access the stack through a pointer as well, and you can cache them both in the registers as well. there is no performance difference for a tight loop for something like `mov rax, [rsp + 10h]` vs. `mov rax, [rcx + 10h]`. – David Haim Jul 06 '17 at 13:57
  • @DavidHaim yeah, I agree with you. Need to edit my answer – pkuderov Jul 06 '17 at 14:15
  • Very strange because when I run your example using a DotNet Core 1.1 console APP in release mode on a 64 bit native windows 10 dell precision M6800 then I get https://gist.github.com/bradphelan/fec409ea979f97151e238ab7acf72a25 ``dot(st,st)`` is always the fastest at about 83ms. But I don't get why ``dot(st)`` is consistently twice as slow as it should be identical to ``dot(st,st)`` Nice extension to my test suite. – bradgonesurfing Jul 07 '17 at 05:39
  • The solution is to set [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute on every ``Dot`` method. Then the results are fast and identical for every example even in my original test case. Here is your updated code and results. https://gist.github.com/bradphelan/7fe5a7d480953b9bb35a40c179884270 – bradgonesurfing Jul 07 '17 at 05:51
  • 1
    @pkuderov I updated/extended your code and also picked pseudo random vectors from an array. The results show that for dot product and forced inlining using the attribute, the class based version is faster. but ....... I also implemented vector add and performed the same test. That is slower for classes than for structs. In summary I think the vendor problem is just to add the aggressive inline attribute on all his small methods as the JIT is just not inlining them ( at least for me ). New code is at https://gist.github.com/bradphelan/9b383c8e99edc38068fcc0dccc8a7b48 – bradgonesurfing Jul 07 '17 at 07:05
  • I wanted to add random picking too but in the end just forgot about that idea! Nice extension to my extension :) Glad to see you figured out the cause)) Btw, I've never been needed to do such low-level optimizations in .net world - most of the time problems are lying on much higher level. – pkuderov Jul 07 '17 at 07:25
1

As I thought , this test doesn't prove much.

TLDR: the compiler completely optimizes away the call to Point3ControlC.Dot while preserves the calls to the other two. the difference is not because structs are faster in this case, but because you skip the entire calculation part.

My settings:

  • Visual studio 2015 update 3
  • .Net framework version 4.6.1
  • Release mode, Any CPU (my CPU is 64 bit)
  • Windows 10
  • CPU: Processor Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz, 2295 Mhz, 2 Core(s), 4 Logical Processor(s)

The generated assembly for

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlA.Dot(vControlA, vControlA);
        }

is:

00DC0573  xor         edx,edx  // temp = 0
00DC0575  mov         dword ptr [ebp-10h],edx // i = temp  
00DC0578  mov         ecx,edi  // load vControlA as first parameter
00DC057A  mov         edx,edi  //load vControlA as second parameter
00DC057C  call        dword ptr ds:[0BA4F0Ch] //call Point3ControlA.Dot
00DC0582  fstp        st(0)  //store the result
00DC0584  inc         dword ptr [ebp-10h]  //i++
00DC0587  cmp         dword ptr [ebp-10h],989680h //does i == n?  
00DC058E  jl          00DC0578  //if not, jump to the begining of the loop

After thoughts:
The JIT compiler for some reason did not use a register for i, so it incremented an integer on the stack (ebp-10h) instead. as result, this test has the poorest performance.

Moving on to the second test:

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

Generated assembly:

00DC0612  xor         edi,edi  //i = 0
00DC0614  mov         ecx,esi  //load vControlB as the first argument
00DC0616  mov         edx,esi  //load vControlB as the second argument
00DC0618  call        dword ptr ds:[0BA4FD4h] // call Point3ControlB.Dot
00DC061E  fstp        st(0) //store the result  
00DC0620  inc         edi  //++i
00DC0621  cmp         edi,989680h //does i == n
00DC0627  jl          00DC0614  //if not, jump to the beginning of the loop     

After thoughts: this generated assembly is almost identical to the first one, but this time, the JIT did use a register for i, hence the minor performance boost over the first test.

Moving on to the test in question:

for (int i = 0; i < n; i++)
        {
            acc += Point3ControlC.Dot(vControlC, vControlC);
        }

And for the generated assembly:

00DC06A7  xor         eax,eax  //i = 0
00DC06A9  inc         eax  //++i
00DC06AA  cmp         eax,989680h //does i == n ?   
00DC06AF  jl          00DC06A9  //if not, jump to the beginning of the loop

As we can see, the JIT has completely optimized away the call for Point3ControlC.Dot , so actually, you only pay for the loop, and not for the call itself. hence this "test" finishes first, as it didn't do much to begin with.

Can we say something about structs vs classes from this test alone? well, no. I'm still not quit sure why has the compiler decided to optimize out the call for the struct-function while preserved the other calls. what I'm sure about is that in real-life code, the compiler can not optimize the call away if the result is used. in this mini-benchmark, we don't do much with the result and even if we did, the compiler can calculate the result on compile time. so the compiler can be more aggressive than it could have been than in real-life code.

David Haim
  • 25,446
  • 3
  • 44
  • 78
  • When you say JIT do you mean IL output from the compiler or the IL output after loading the DLL into the running application. I was not able to see the true JIT'd code as I couldn't get a breakpoint to stop with optimised debug code. Im on a mobile so I can't really verify your answer at the moment till Monday. – bradgonesurfing Jul 08 '17 at 12:50
  • Also the result was written to the console for each test. How could the compiler/JIT remove the calculation yet get tell result correct. Unless it did the dot product and sum at JIT time.. very cool if that is so. – bradgonesurfing Jul 08 '17 at 12:52
  • I mean the generated machine code. not IL. you can do that in VS. the code you linked (which I used) does not use the result. and anyway, calculating stuff on compile time is not impressive. compilers do it since the 90's. – David Haim Jul 08 '17 at 12:54
  • How did you see the generated machine code during debugging. You are aware that under normal debugging the JiT optimizer is turned off. If you turn it on VS warns you and then generally refuses to stop at break points. Anyway if you read my edits to the question the interesting questions have been answered by using a modified version of another answer which does use the result so it can't be optimized away. – bradgonesurfing Jul 08 '17 at 18:31
  • But you are right I don't use the result of acc in the posted code. I might have made a mistake copying it. However the newer code posted at the bottom via link to gist does use it. – bradgonesurfing Jul 08 '17 at 18:39
  • debugging a release mode is not the same as debugging a debug mode. the code is still optimized. – David Haim Jul 08 '17 at 18:39
  • There is an option under Debug options that disables the JIT optimizer during debugging. It is normally set. If you unset it VS warns you that your debugging experience will be suboptimal. I myself only discovered this this week. – bradgonesurfing Jul 08 '17 at 18:43
  • https://stackoverflow.com/questions/12243410/what-is-the-effect-of-suppress-jit-optimization-on-module-load-debugging-optio – bradgonesurfing Jul 08 '17 at 18:46
  • it **is** the generated assembly for the optimized code. – David Haim Jul 09 '17 at 08:12
  • So you keep saying and im not saying you are wrong but I am asking how you got access to it and did you ensure the setting that I pointed you to was correctly set. I am just trying to understand so I can replicate it myself. – bradgonesurfing Jul 09 '17 at 11:07
  • This explains how to see the optimized code for release build. I was missing turning on the generation of PDB for the release build. Without thar you can't set a breakpoint. https://blogs.msdn.microsoft.com/clrcodegeneration/2007/10/19/how-to-see-the-assembly-code-generated-by-the-jit-using-visual-studio/ – bradgonesurfing Jul 09 '17 at 11:20
  • build the project on release mode, put some break point, right click -> Go To disassembly and voila. whoever says this is not an optimized assembly probably has never dealt with assembly in his life – David Haim Jul 09 '17 at 11:38
  • Nobody is disagreeing with you David. We are just trying to verify the process as it stands with regards to all the options in VS. – bradgonesurfing Jul 09 '17 at 11:46
  • You have to deselect "Enable just my code" in the debug options. When the JIT optimizer is enabled and "enable just my code is selected" the debugger refuses to set breakpoints. It's a feature, not a bug according to MS. I can now see the optimized generated assembly. – bradgonesurfing Jul 10 '17 at 06:23