Huge performance difference (26x faster) when compiling for 32 and 64 bits

Question

I was trying to measure the difference of using a for and a foreach when accessing lists of value types and reference types.

I used the following class to do the profiling.

public static class Benchmarker
{
    public static void Profile(string description, int iterations, Action func)
    {
        Console.Write(description);

        // Warm up
        func();

        Stopwatch watch = new Stopwatch();

        // Clean up
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();

        watch.Start();
        for (int i = 0; i < iterations; i++)
        {
            func();
        }
        watch.Stop();

        Console.WriteLine(" average time: {0} ms", watch.Elapsed.TotalMilliseconds / iterations);
    }
}

I used double for my value type. And I created this 'fake class' to test reference types:

class DoubleWrapper
{
    public double Value { get; set; }

    public DoubleWrapper(double value)
    {
        Value = value;
    }
}

Finally I ran this code and compared the time differences.

static void Main(string[] args)
{
    int size = 1000000;
    int iterationCount = 100;

    var valueList = new List<double>(size);
    for (int i = 0; i < size; i++) 
        valueList.Add(i);

    var refList = new List<DoubleWrapper>(size);
    for (int i = 0; i < size; i++) 
        refList.Add(new DoubleWrapper(i));

    double dummy;

    Benchmarker.Profile("valueList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < valueList.Count; i++)
        {
             unchecked
             {
                 var temp = valueList[i];
                 result *= temp;
                 result += temp;
                 result /= temp;
                 result -= temp;
             }
        }
        dummy = result;
    });

    Benchmarker.Profile("valueList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in valueList)
        {
            var temp = v;
            result *= temp;
            result += temp;
            result /= temp;
            result -= temp;
        }
        dummy = result;
    });

    Benchmarker.Profile("refList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < refList.Count; i++)
        {
            unchecked
            {
                var temp = refList[i].Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }
        dummy = result;
    });

    Benchmarker.Profile("refList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in refList)
        {
            unchecked
            {
                var temp = v.Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }

        dummy = result;
    });

    SafeExit();
}

I selected Release and Any CPU options, ran the program and got the following times:

valueList for:  average time: 483,967938 ms
valueList foreach:  average time: 477,873079 ms
refList for:  average time: 490,524197 ms
refList foreach:  average time: 485,659557 ms
Done!

Then I selected Release and x64 options, ran the program and got the following times:

valueList for:  average time: 16,720209 ms
valueList foreach:  average time: 15,953483 ms
refList for:  average time: 19,381077 ms
refList foreach:  average time: 18,636781 ms
Done!

Why is x64 bit version so much faster? I expected some difference, but not something this big.

I do not have access to other computers. Could you please run this on your machines and tell me the results? I'm using Visual Studio 2015 and I have an Intel Core i7 930.

Here's the SafeExit() method, so you can compile/run by yourself:

private static void SafeExit()
{
    Console.WriteLine("Done!");
    Console.ReadLine();
    System.Environment.Exit(1);
}

As requested, using double? instead of my DoubleWrapper:

Any CPU

valueList for:  average time: 482,98116 ms
valueList foreach:  average time: 478,837701 ms
refList for:  average time: 491,075915 ms
refList foreach:  average time: 483,206072 ms
Done!

x64

valueList for:  average time: 16,393947 ms
valueList foreach:  average time: 15,87007 ms
refList for:  average time: 18,267736 ms
refList foreach:  average time: 16,496038 ms
Done!

Last but not least: creating a x86 profile gives me almost the same results of using Any CPU.

"Any CPU" != "32Bits"! If compiled "Any CPU" your application should run as a 64bit process on your 64bit system. Also I'd remove the code messing with the GC. It doesn't actually help. — Thorsten Dittmar, Aug 07 '15 at 10:32
@ThorstenDittmar the GC calls are prior to the measurement, rather than in the code measured. That's a reasonable enough thing to do to reduce the degree to which luck of GC timing can affect such a measurement. Also, there's "favour 32-bit" vs "favour 64-bit" as a factor between the builds. — Jon Hanna, Aug 07 '15 at 10:39
@ThorstenDittmar But I run the release version (outside of Visual Studio) and the Task Manager says it is a 32 bit application (when compiled to Any CPU). Also. As Jon Hanna said, the GC call is useful. — Trauer, Aug 07 '15 at 10:40
Which runtime version are you using? The new RyuJIT in 4.6 is a *lot* faster, but even for earlier versions, the x64 compiler and JITer were newer and more advanced than the x32 versions. They are able to perform far more aggressive optimizations than the x86 versions. — Panagiotis Kanavos, Aug 07 '15 at 10:53
I'd note that the type involved seems to have no effect; change `double` to `float`, `long` or `int` and you get similar results. — Jon Hanna, Aug 07 '15 at 10:54
@JonHanna I don't think it does. a) it doesn't run GC right away, b) there's nothing to GC in this application at this point. — Thorsten Dittmar, Aug 07 '15 at 12:09
It might be interesting to test a loop body that is throughput-limited, rather than just being one single loop-carried dependency chain. As it is, all of the work depends on previous results; there's nothing for the CPU to do in parallel (other than bounds-check the next array load while the mul/div chain is running). You might see more difference between the methods if the "real work" occupied more of the CPUs execution resources. Also, on pre-Sandybridge Intel, there's a big difference between a loop fitting in the 28uop loop buffer or not. (Instruction decode bottlenecks if not.) — Peter Cordes, Aug 07 '15 at 13:24
@panagiotis Since when does RytJIT produce faster code than the old x64 JIT? I only benchmarked preview versions, which were significantly slower. Did they improve it that much? — CodesInChaos, Aug 07 '15 at 22:18
That is exactly why you should post "the shortest code necessary to reprodude the problem" on [so]. Jumbling addition, substraction, multiplication and division into one test is just obfuscating the potential issue. — Anthon, Aug 08 '15 at 09:12
By default, visual studio AnyCPU configuration has "Prefer 32-bit" enabled, which means, your process will run as 32-bit on every platform. You can disable the "Prefer 32-bit" option, if you want to see your process run as 64-bit with AnyCPU configuration. — Erti-Chris Eelmaa, Aug 08 '15 at 12:50

score 87 · Accepted Answer · edited May 23 '17 at 12:26

87

I can reproduce this on 4.5.2. No RyuJIT here. Both x86 and x64 disassemblies look reasonable. Range checks and so on are the same. The same basic structure. No loop unrolling.

x86 uses a different set of float instructions. The performance of these instructions seems to be comparable with the x64 instructions except for the division:

The division operation makes the 32 bit version extremely slow. Uncommenting the division equalizes performance to a large degree (32 bit down from 430ms to 3.25ms).

Peter Cordes points out that the instruction latencies of the two floating point units are not that dissimilar. Maybe some of the intermediate results are denormalized numbers or NaN. These might trigger a slow path in one of the units. Or, maybe the values diverge between the two implementations because of 10 byte vs. 8 byte float precision.

Peter Cordes also points out that all intermediate results are NaN... Removing this problem (valueList.Add(i + 1) so that no divisor is zero) mostly equalizes the results. Apparently, the 32 bit code does not like NaN operands at all. Let's print some intermediate values: if (i % 1000 == 0) Console.WriteLine(result);. This confirms that the data is now sane.

When benchmarking you need to benchmark a realistic workload. But who would have thought that an innocent division can mess up your benchmark?!

Try simply summing the numbers to get a better benchmark.

Division and modulo are always very slow. If you modify the BCL Dictionary code to simply not use the modulo operator to compute the bucket index performance measurable improves. This is how slow division is.

Here's the 32 bit code:

64 bit code (same structure, fast division):

This is not vectorized despite SSE instructions being used.

edited May 23 '17 at 12:26

Community

1
1

answered Aug 07 '15 at 11:21

usr

168,620
35
240
369

IL usually is meaningless when debugging such low level performance issues. Hard to predict what the JIT can and cannot do. The compiled x86 code is the truth. – usr Aug 07 '15 at 11:53
11

"Who would have thought that an innocent division can mess up your benchmark?" I did, right away as soon as I saw a division in the inner loop, esp. as part of the dependency chain. Division is only innocent when it's integer division by a power of 2. From http://agner.org/optimize/ insn tables: Nehalem `fdiv` is 7-27 cycles latency (and same reciprocal throughput). `divsd` is 7-22 cycles. `addsd` at 3c latency, 1/c throughput. Division is the only non-pipelined execution unit in Intel/AMD CPUs. C# JIT isn't vectorizing the loop for x86-64 (with `divPd`). – Peter Cordes Aug 07 '15 at 13:02
1

Also, is it normal for 32b C# not to use SSE math? Isn't being able to use the features of the current machine part of the point of JIT? So on Haswell and later, it could auto-vectorize integer loops with 256b AVX2, instead of just SSE. To get vectorization of FP loops, I guess you'd have to write them with stuff like 4 accumulators in parallel, since FP math isn't associative. But anyway, using SSE in 32bit mode is faster, because you have fewer instructions to do the same scalar work when you don't have to juggle the x87 FP stack. – Peter Cordes Aug 07 '15 at 13:02
4

Anyway, div is very slow, but 10B x87 fdiv isn't much slower than 8B SSE2, so this this doesn't explain the difference between x86 and x86-64. What could explain it is FPU exceptions or slowdowns with denormals / infinities. The x87 FPU control word is separate from the SSE rounding / exception control register (`MXCSR`). Different handling of denormals or `NaN`s could I think explain the factor of 26 perf diff. C# may set denormals-are-zero in the MXCSR. – Peter Cordes Aug 07 '15 at 13:03
denormalized floats are a performance nightmare - they come from nowhere as intermediate results in certain constellations and performance drops by a factor of 20 without any code change :-X – Falco Aug 07 '15 at 13:17
2

@Trauer and usr: I just noticed that the `valueList[i] = i`, starting from `i=0`, so the first loop iteration does `0.0 / 0.0`. So every operation in your entire benchmark is done with `NaN`s. That division is looking less and less innocent! I'm not an expert on performance with `NaN`s, or the difference between x87 and SSE for this, but I think this explains the 26x perf difference. I bet your results will be a *lot* closer between 32 and 64bit if you initialize `valueList[i] = i+1`. – Peter Cordes Aug 07 '15 at 13:30
@Falco: It's too bad C botched its treatment of the extended-precision type, since one of the advantages of that type is that in code which uses the type properly, the performance costs of loading or storing a denormalized `double` can often be greatly reduced. Further, the semantic problems of having a properly-used extended-precision type flush to zero are much less than those of having `double` flush to zero. Extended-precision types aren't normally used in large arrays, so adding 16 or even 48 bits of padding on each value wouldn't harm performance. – supercat Aug 07 '15 at 20:46
@supercat: I'm not sure I follow you. You wish C had pushed HW designers to add nicer extended-precision float with flush-to-zero? x87 is a special case; most machines need tricks like double-double (https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic) for extended precision. Just FYI, `long double` loads/stores get so little use that on current Intel designs they're ~4x slower than 32/64 bit x87 loads/stores with on-the-fly conversion. If it was used, Intel could make it fast, though. – Peter Cordes Aug 08 '15 at 02:20
@PeterCordes: Most processors without FPUs can operate significantly more quickly on a type with a 64-bit mantissa with an "explicit" leading 1 than with a 52-bit mantissa and an implied leading 1. Intel may have been the only hardware vendor to take a significant interest in the format, but it was also supported by the Standard Apple Numerical Environment in the 1980s Macintosh (I believe since its January 1984 debut). – supercat Aug 08 '15 at 17:56
@PeterCordes: C was designed with the intention that all operations on floating-point math will promote them to the longest available format, and floating-point arguments to variadic functions will promote them to that format. If ANSI C had provided a prototype syntax (e.g. `printf(const char *fnt, ... (double));` to indicate that all floating-point types should be coerced to `double` [or `...(long double)` for that type] then the language could have specified that operations on `double` promote to `long double` (which may or may not be larger than `double`)... – supercat Aug 08 '15 at 18:03
...and operations on `float` promote to `long float` (which could be equivalent to `float`, `long double`, or anything in-between). Systems where rounding values to 53-bit precision after every operation would be slower than keeping them as 64-bit precision could define `long double` as an 80-bit type; those where using 53-bit precision throughout would be fastest could define `long double` as that type. The default type for unprefixed floating-point constants would be `long double`. As far as I'm concerned, those semantics would have been nicer than what's happened instead. – supercat Aug 08 '15 at 18:08
1

As for flush-to-zero, I'm not too keen on it with 64-bit double, but when 80bit extended and 64-bit double are used together, situations where an 80-bit value could underflow and then get scaled up enough to yield a value which would be representable as 64-bit `double` would be pretty rare. One of the main usage patterns fro 80-bit type was to allow multiple numbers to be summed together without having to tightly round the results until the very end. Under that pattern, overflows just aren't a problem. – supercat Aug 08 '15 at 18:15
@supercat: Ok, I think I see what you're saying. Most compilers in-practice do use long-double temporaries with x87 floating point, even though that's not what the standard says. gcc has a `-ffloat-store` option to store-and-reload floats between every expression, to round them off to the precision C says they're supposed to have. I think that option is off even without `-ffast-math`, because it's so expensive. But any time a float temporary is spilled to the stack, it does get rounded. So I agree, that would have been nice. – Peter Cordes Aug 09 '15 at 03:02
IDK if common usage of 80bit temporaries would have led Intel to design SSE any differently, though. Two 64bit doubles fit in a 128bit register. You'd only get 64bit temps for 32b float, at best. Making room for extended precision would be costly. Probably even scalar-only extended precision (80 bits in a 128b register) would have been extra complexity for piping data around, and made it less convenient to pull scalar FP data into vectorized computations. Until Core2, Intel CPUs handled 128b `addps` and so on as two 64bit halves. – Peter Cordes Aug 09 '15 at 03:12
@PeterCordes: Most vectorized floating-point computations, from what I understand, entail fetching a small groups of values, operating on them, and then storing a rounded result in memory, and/or adding the result to a running total in a register. I don't think there should have been been any particular problem having e.g. 192-bit registers each of which can hold two 80-bit values, four 48-bit values, or one 80-bit and two 48-bit values, along with instructions to load the/store the register or portions thereof in packed or unpacked format. If software performs e.g. – supercat Aug 09 '15 at 20:27
...load r0 with four floats from p0, load r1 with four floats from p1, parallel-add r0+=r1, load r2 with four floats from p2, parallel-add r0+=r2, and store r0 to p3, such code would require three 128-bit reads and one 128-bit store; keeping the intermediate computation in extended precision wouldn't require any extra memory bandwidth. While it's possible to write calculations so as to avoid having results get trashed by catastrophic calculations, in many cases having more precise intermediate types can reduce the number of required calculations. – supercat Aug 09 '15 at 20:36
@supercat. Yes, that would work, and I can see how it would be useful. However, that would give FP vectors wider elements than int vectors, so the execution units for shuffles / blends couldn't be shared with integer shuffles (as they are now). There would be a real cost to this, in transistors. And if the registers are internally 192bit, people would (reasonably or not) demand to be able to use that full width for integer data, etc. etc. 24B vectors don't evenly go into 64B cache lines, so loads that cross cache lines would be common. Those used to be very expensive, and are still a bit – Peter Cordes Aug 10 '15 at 00:17
@PeterCordes: As I would design it, all loads or stores would operate on power-of-two numbers of bits; storing a 48-bit register would either pack the result down to 32 bits add 16 bits of padding (unpacked values wouldn't generally be stored in arrays, so the padding would be no problem). No need for unaligned loads/stores. Further, if a 32-bit mantissa would suffice for many needs that presently require "double" (I expect it would) hardware which could perform four 25x32-bit multiplies per cycle (for the case where one of the operands was freshly-loaded)... – supercat Aug 10 '15 at 15:38
...would be much cheaper than hardware to perform two 53x53-bit multiplies per cycle, while offering more practical real-world performance (having the hardware take two cycles if neither operand was freshly loaded would probably not hurt performance too much, but if one wanted unconditional single-cycle operation, one could expand to four 32x32 multiplies and still be much cheaper than two 53x53. – supercat Aug 10 '15 at 15:40

score 31 · Answer 2 · edited May 23 '17 at 12:34

31

valueList[i] = i, starting from i=0, so the first loop iteration does 0.0 / 0.0. So every operation in your entire benchmark is done with NaNs.

As @usr showed in disassembly output, the 32bit version used x87 floating point, while 64bit used SSE floating point.

I'm not an expert on performance with NaNs, or the difference between x87 and SSE for this, but I think this explains the 26x perf difference. I bet your results will be a lot closer between 32 and 64bit if you initialize valueList[i] = i+1. (update: usr confirmed that this made 32 and 64bit performance fairly close.)

Division is very slow compared to other operations. See my comments on @usr's answer. Also see http://agner.org/optimize/ for tons of great stuff about hardware, and optimizing asm and C/C++, some of it relevant to C#. He has instruction tables of latency and throughput for most instructions for all recent x86 CPUs.

However, 10B x87 fdiv isn't much slower than SSE2's 8B double precision divsd, for normal values. IDK about perf differences with NaNs, infinities, or denormals.

They have different controls for what happens with NaNs and other FPU exceptions, though. The x87 FPU control word is separate from the SSE rounding / exception control register (MXCSR). If x87 is getting a CPU exception for every division, but SSE isn't, that easily explains the factor of 26. Or maybe there's just a performance difference that big when handling NaNs. The hardware is not optimized for churning through NaN after NaN.

IDK if the SSE controls for avoiding slowdowns with denormals will come into play here, since I believe result will be NaN all the time. IDK if C# sets the denormals-are-zero flag in the MXCSR, or the flush-to-zero-flag (which writes zeroes in the first place, instead of treating denormals as zero when read back).

I found an Intel article about SSE floating point controls, contrasting it with the x87 FPU control word. It doesn't have much to say about NaN, though. It ends with this:

Conclusion

To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications.

IDK if this helps any with divide-by-zero.

for vs. foreach

It might be interesting to test a loop body that is throughput-limited, rather than just being one single loop-carried dependency chain. As it is, all of the work depends on previous results; there's nothing for the CPU to do in parallel (other than bounds-check the next array load while the mul/div chain is running).

You might see more difference between the methods if the "real work" occupied more of the CPUs execution resources. Also, on pre-Sandybridge Intel, there's a big difference between a loop fitting in the 28uop loop buffer or not. You get instruction decode bottlenecks if not, esp. when the average instruction length is longer (which happens with SSE). Instructions that decode to more than one uop will also limit decoder throughput, unless they come in a pattern that's nice for the decoders (e.g. 2-1-1). So a loop with more instructions of loop overhead can make the difference between a loop fitting in the 28-entry uop cache or not, which is a big deal on Nehalem, and sometimes helpful on Sandybridge and later.

edited May 23 '17 at 12:34

Community

1
1

answered Aug 07 '15 at 13:49

Peter Cordes

328,167
45
605
847

I've never had a case where I observed any performance difference based upon whether NaNs were in my data stream, but the presence of denormalized numbers can make a *huge* difference in performance. It doesn't appear to be the case in this example, but it's something to keep in mind. – Jason R Aug 07 '15 at 13:56
@JasonR: Is that just because `NaN`s are really rare in practice? I left in all the stuff about denormals, and the link to Intel's stuff, mostly for the benefit of readers, not because I thought it would really have much effect on this specific case. – Peter Cordes Aug 07 '15 at 14:00
In most applications they are rare. However, when developing new software that uses floating point, it's not rare at all for implementation bugs to yield streams of NaNs instead of the desired results! This has occurred to me many times and I don't recall any noticeable performance hit when NaNs pop up. I have observed the opposite if I do something that causes denormals to appear; that typically results in an immediately-noticeable drop in performance. Note that these are just based on my anecdotal experience; there may be some performance drop with NaNs that I just haven't noticed. – Jason R Aug 07 '15 at 15:26
@JasonR: IDK, maybe NaNs aren't much if any slower with SSE. Clearly they're a big problem for x87. SSE FP semantics were designed by Intel in the PII / PIII days. Those CPUs have the same out-of-order machinery under the hood as current designs, so presumably they had high performance for P6 in mind when designing SSE. (Yes, Skylake is based on the P6 microarchitecture. Some things have changed, but it still decodes to uops, and schedules them to execution ports with a re-order buffer.) x87 semantics were designed for an optional external co-processor chip for an in-order scalar CPU. – Peter Cordes Aug 07 '15 at 15:40
@PeterCordes Calling Skylake a P6-based chip is too far a stretch. 1) The FPU was (almost) totally redesigned during the Sandy Bridge era, so the old P6 FPU is basically gone as far as today; 2) the x86 to uop decode had a critical modification during the Core2 era: while previous designs decode compute and memory instruction as separate uops, the Core2+ chip have uops consisting of a compute instruction **and** a memory operator. This led to greatly increased performance and power efficienty, at the cost of more complex design and potentially lower peak frequency. – shodanshok Aug 07 '15 at 22:02
@shodanshok: Oh yeah, there have been huge changes from Pentium Pro to Skylake, as one part of another of the design got major improvements. However, it's not just laziness that has led Intel to keep `fam=6` as the reported CPUID. Unlike netburst, it got to this point incrementally, and still has similarities. In http://www.realworldtech.com/sandy-bridge/5/, David Kanter refers to it as P6-based. (I'd say Sandybridge's switch to a physical register file, and also adding a uop cache, was one of the biggest changes in P6's evolution.) – Peter Cordes Aug 08 '15 at 01:53
@shodanshok: micro-fusion is great, and I'm curious to see if Skylake re-enables it for 2-register addressing modes, since AVX512 in Xeon SKL should require more input deps per uop. (SnB/HSW can only micro-fuse one-register modes: http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes). The out-of-order mechanism is still basically the same. i.e. it's not like Transmeta's binary-translation, or a kilo-instruction processor with occasional state checkpointing and out-of-order *retirement* to hide giant memory latencies (http://m3.csl.cornell.edu/papers/taco04.pdf). – Peter Cordes Aug 08 '15 at 02:00
@PeterCordes True, but a line must be draw at the end. Generally, the industry identify that line in the Nehelem / Sandy Bridge transition. From the very same DK article you linked: _The Sandy Bridge CPU cores can truly be described as a brand new microarchitecture that is a synthesis of the P6 and some elements of the P4. Although Sandy Bridge most strongly resembles the P6 line, it is an utterly different microarchitecture_ Is short: it is the spiritual successor of the P6, with the same background philosophy, but with a vastly different implementation. – shodanshok Aug 08 '15 at 07:50
@PeterCordes Another interesting quote: _Sandy Bridge uses fundamentally different, and more efficient, techniques for tracking and renaming the uops in flight, as compared to the approach used in P6 derivatives such as Nehalem_ Here you can see how DK consider Nehalem to be P6-based, but not the then-new Sandy Bridge. – shodanshok Aug 08 '15 at 07:53
@shodanshok: heh, I missed that part when skimming. Apparently I was mistaken, and SnB was the point at which P6 evolved into a new species. I noticed the quote I found really was only saying Nehalem was P6. Thanks for finding the others. The CPUID is still `fam=6`, but apparently that's just inertia. Changing it would maybe be annoying for OS writers. The main visible effect of the major redesign is that register-read stalls don't exist anymore. (IDK if the uop cache and two read ports would have been possible without the change to a PRF. Those are huge for perf.) – Peter Cordes Aug 08 '15 at 08:03

score 1 · Answer 3 · answered Aug 08 '15 at 13:25

We have the observation that 99.9% of all the floating point operations will involve NaN's, which is at least highly unusual (found by Peter Cordes first). We have another experiment by usr, which found that removing the division instructions makes the time difference almost completely go away.

The fact however is that the NaN's are only generated because the very first division calculates 0.0 / 0.0 which gives the initial NaN. If the divisions are not performed, result will always be 0.0, and we will always calculate 0.0 * temp -> 0.0, 0.0 + temp -> temp, temp - temp = 0.0. So removing the division did not only remove the divisions, but also removed the NaNs. I would expect that the NaN's are actually the problem, and that one implementation handles NaN's very slowly, while the other one doesn't have the problem.

It would be worthwhile starting the loop at i = 1 and measuring again. The four operations result * temp, + temp, / temp, - temp effectively add (1 - temp) so we wouldn't have any unusual numbers (0, infinity, NaN) for most of the operations.

The only problem could be that the division always gives an integer result, and some division implementations have shortcuts when the correct result doesn't use many bits. For example, dividing 310.0 / 31.0 gives 10.0 as the first four bits with a remainder of 0.0, and some implementations can stop evaluating the remaining 50 or so bits while others can't. If there is a significiant difference, then starting the loop with result = 1.0 / 3.0 would make a difference.

Matthew Layton · Answer 4 · 2015-08-07T17:12:19.773

-2

There may be several reasons why this is executing faster in 64bit on your machine. The reason I asked which CPU you were using was because when 64bit CPUs first made their appearance, AMD and Intel had different mechanisms to handle 64bit code.

Processor architecture:

Intel's CPU architecture was purely 64bit. In order to execute 32bit code, the 32bit instructions needed to be converted (inside the CPU) to 64bit instructions before execution.

AMD's CPU architecture was to build 64bit right on top of their 32bit architecture; that is, it was essentially a 32bit architecture with 64bit extentions - there was no code conversion process.

This was obviously a few years ago now, so I've no idea if/how the technology has changed, but essentially, you would expect 64bit code to perform better on a 64bit machine since the CPU is able to work with double the amount of bits per instruction.

.NET JIT

It's argued that .NET (and other managed languages like Java) are capable of outperforming languages like C++ because of the way the JIT compiler is able to optimize your code according to your processor architecture. In this respect, you might find that the JIT compiler is utilizing something in 64bit architecture that possibly wasn't available or required a workaround when executed in 32bit.

Note:

Rather than using DoubleWrapper, have you considered using Nullable<double> or shorthand syntax: double? - I'd be interested to see if that has any impact on your tests.

Note 2: Some people seem to be conflating my comments about 64bit architecture with IA-64. Just to clarify, in my answer, 64bit refers to x86-64 and 32bit refers to x86-32. Nothing here referenced IA-64!

edited Aug 07 '15 at 17:12

answered Aug 07 '15 at 10:45

Matthew Layton

39,871
52
185
313

I'll do that. Give me a second. – Trauer Aug 07 '15 at 10:49
4

OK, so why is it 26x faster? Cannot find this in the answer. – usr Aug 07 '15 at 10:54
2

I'm guessing it's the jitter differences, but no more than guessing. – Jon Hanna Aug 07 '15 at 10:55
The Itanium (IA-64) CPU in fact does not do translations, while the x86 and x64 architectures do. And it didn't have 64 bit instructions. They were 41 bits. (Yes, that's weird). IA-64 had 64 bit pointers, but that's something else. x86 instructions are also weird, they're variable-length (neither 32 bits not 64). – MSalters Aug 07 '15 at 11:56
@MSalters: IIRC, later IA-64 designs removed ability to execute x86 instructions from the hardware. Running IA-32 instructions on a late-generation Itanium required an interpreter or JIT binary-translation layer. – Peter Cordes Aug 07 '15 at 16:57
2

@seriesOne: I think MSalters is trying to say you're mixing up IA-64 with x86-64. (Intel also uses IA-32e for x86-64, in their manuals). Everyone's desktop CPUs are x86-64. The Itanic sank a few years ago, and I think was mostly used in servers, not workstations. Core2 (the first P6 family CPU supporting x86-64 long mode) actually has some limitations in 64bit mode. e.g. uop macro-fusion only works in 32bit mode. Intel and AMD did the same thing: extended their 32bit designs to 64bit. – Peter Cordes Aug 07 '15 at 17:05
1

@PeterCordes where did I mention IA-64? I am aware that Itanium CPUs were an entirely different design and instruction set; early models tagged as EPIC or Explicitly Parallel Instruction Computing. I think MSalters is conflating 64bit and IA-64. My answer holds true for x86-64 architecture- there was nothing in there referencing the Itanium CPU family – Matthew Layton Aug 07 '15 at 17:09
2

@series0ne: Ok, then your paragraph about Intel CPUs being "purely 64bit" is complete nonsense. I assumed you were thinking of IA-64 because then you wouldn't have been completely wrong. There was never an extra translation step for running 32bit code. The x86->uop decoders just have two similar modes: x86 and x86-64. Intel built 64bit P4 on top of P4. 64bit Core2 came with many other architectural improvements over Core and Pentium M, but things like macro-fusion only working in 32bit mode show that 64bit was bolted on. (fairly early in the design process, but still.) – Peter Cordes Aug 07 '15 at 17:16
@series0ne: see Agner Fog's microarch docs for details of how CPUs operate, from Pentium to Haswell. (And AMD CPUs, too.) http://agner.org/optimize/. Also good writeups for most microarchitectures at http://realworldtech.com/ – Peter Cordes Aug 07 '15 at 17:18
@PeterCordes I will try and find a reference to what I'm talking about – Matthew Layton Aug 07 '15 at 17:18
@series0ne: When you said "pure 64 bits", I indeed assumed Itanium because that's Intel's only 64 bit architecture. x86-64 is an AMD design, and was always a hybrid with even 8086 support. No idea what design you're thinking of, though. – MSalters Aug 08 '15 at 20:44

Huge performance difference (26x faster) when compiling for 32 and 64 bits

4 Answers4

for vs. foreach

Linked