0

I did a simple performance comparison focused on floating point operations using C#, targeted for Raspberry Pi 3 Model 2 with Windows 10 IoT and I have compare it to Intel Core i7-6500U CPU @ 2.50GHz.

Raspberry Pi 3 Model B V1.2 - Test Results - Chart

Intel Core i7-6500U CPU @ 2.50GHz - x64 Test Results - Chart

Intel Core i7 is only twelve times faster (x64) than Raspberry Pi 3! - according to those tests.

The factor is 11.67 to be exact and was calculated for best performance achieved within those tests on each platform. Both platforms achieved best performance for four threads run in parallel making computations (very simple, independent calculations).

Question: what is the correct method to measure and compare computational performance such platforms? The intention is to compare computational performance in the area of optimization algorithms, machine learning algorithms, statistical analysis, etc. Thus my focus was on floating point operations.

There are some benchmarks (like MWIPS) and measures like MIPS or FLOPS. But I didn't found one way to compare different CPU platforms in terms of computational power.

I found one comparison by Roy Longbottom's (Google "Roy Longbottom's Raspberry Pi, Pi 2 and Pi 3 Benchmarks" - I can not post more links here) but according to his benchmark Raspberry Pi 3 is only four times faster than Intel Core i7 (x64 architecture, MFLOPS comparison). So very different than my results.

Here are details of the tests that I performed:

The test was build around simple operation supposed to be executed iteratively:

    private static float SingleAverageCalc(float seed, long nTimes)
    {
        float x1 = seed, x2 = 0;
        long n = 0;

        for (; n < nTimes; ++n)
        {
            x2 = x2 + x1 * n;
        }

        return x2 / n;
    }

where seed is generated randomly in the calling function and nTimes is the number of iterations. Intention is to avoid simple compile-time optimisations.

This test function has been called several times with various iteration number (1M, 10M, 100M and 1B) in single thread and for multiple threads. Multithreaded test look as below:

    private static async void RunTestMT(string name, long n, int tn, Func<float, long, float> f)
    {
        float seed = (float)new Random().NextDouble();
        DateTime s1 = DateTime.Now;
        List<IAsyncAction> threads = new List<IAsyncAction>();
        for (int i = 0; i < tn; i++)
        {
            threads.Add( ThreadPool.RunAsync((operation) => { f(seed, n/tn); }, WorkItemPriority.High));
        }
        for (int i = 0; i < tn; i++)
        {
            threads[i].AsTask().Wait();
        }
        TimeSpan dt = DateTime.Now - s1;

        Debug.WriteLine(String.Format("{0} ({1:N0}; {3}T): {2:mm\\:ss\\.fff}", name, n, dt, tn));
    }

Test have been run in Debug mode. Application was built as UWP (Universal Windows Platform). ARM architecture for Raspberry Pi and x86 for Intel.

Pawel
  • 900
  • 2
  • 10
  • 19
  • 1
    Even if your benchmark was telling (it isn't), why would you consider an order of magnitude processing power difference to be insignificant? – Luaan Nov 21 '16 at 12:49
  • Why are you talking about "Core i7" like it was a single microarchitecture? There's a *significant* difference between Nehalem (the first i7) and Skylake (your i7). e.g. see [Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs](http://stackoverflow.com/questions/37361145/deoptimizing-a-program-for-the-pipeline-in-intel-sandybridge-family-cpus/37362225#37362225) where I explained this. – Peter Cordes Nov 21 '16 at 13:46
  • Thanks for comments, guys. That was my question - if there is a decent yet simple way to compare different CPU platforms for performance in mentioned regard. Simple enough, so very different platforms could be at least compared to the magnitude in some optimal, long-running scenario. Simplicity was a key in that question. Sorry, if I didn't state that clearly. – Pawel Nov 21 '16 at 15:08
  • I think you meant "slower" in "but according to his benchmark Raspberry Pi 3 is only four times faster than Intel Core i7" – Nathan B Sep 10 '17 at 08:06

2 Answers2

7

Test have been run in Debug mode.

Just noticed this last part. /facepalm. If C# debug-mode is anything like the debug mode in compilers like MSVC, gcc, and clang, then that's useless and a waste of everyone's time.

The speed ratio between debug and optimized code is not constant across different microarchitectures. It varies with many factors, including the specific code being tested. If anything, the extra stores/reloads will introduce extra latency and punish Skylake more than ARM, since Skylake is capable of achieving higher instructions-per-clock when latency bottlenecks like that aren't slowing it down.


The answer to your question if you retry in Release mode and find similar results:

If you didn't use any kind of fast-math option to let C# reorder FP operations, x2 = x2 + x1 * n; bottlenecks mostly on latency (of the FP add), not on throughput.

FP math is not associative, so rearranging it to x2 += (x1 * n + x1 * (n+1)) + (x1 * (n+2) + x1 * (n+3)) would change the result. This kind of optimization is key, and would make the loop-carried dependency chain (one FP add) shorter than the throughput of the independent dependency chains.

If C# has a fast-math option that allows the compiler to optimize as if FP math were associative, a smart compiler would just turn the whole loop into x2 = x1 * (nTimes * (nTimes+1) / 2).

A less-crafty compiler might just SIMD-vectorize it with a bunch of instruction-level parallelism that would allow Skylake to achieve its peak throughput of two FMAs of 256b vectors per clock. (That's 8 floats or 4 doubles per vector, and fused multiply-add is a = a + b*c.)

On Skylake, FMA has a latency of 4 cycles, so you (or the compiler) needs to use 8 vectors of accumulators to saturate the FMA execution units. (On Haswell and Broadwell, FMA latency = 5 cycles, so you need 10 vector accumulators to keep 10 * 8 single-precision FMAs in flight to max out the throughput.)

See the tag wiki for more about x86 performance details.


Of course, that will also help the ARM CPU in an RPi, since I assume it supports ARM's SIMD instruction set. NEON has 128-bit vectors. Apparently double-precision vector support is new in AArch64, but 32-bit ARM with NEON supports vector-FP with single-precision.

I don't know a lot about ARM.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

Question: what is the correct method to measure and compare computational performance such platforms? <

My strategy is based on that there is not one single measure that can be representative. Hence, I have numerous benchmarks measuring different performance characteristics.

In the main RPi document of mine that you quote, comparisons with Core i7 are shown for the ancient standard benchmarks, Whetstone, Dhrystone, Linpack and Livermore Loops. These provide 15 different measurements, where the i7 is between 4 and 14 times faster, compared with 3.25 for CPU MHz. Then there are a number of other benchmarks, not compared, typically with 60 measurements of different functions using data from caches or RAM.

The figure you quote (wrong way round saying RPi 3 is 4 times faster) is from the multithreading report using four cores, where it seems that you are comparing MFLOPS/MHz ratios of 5.025 for RPi 3 with 23 for Core i7, with measured MFLOPS of 6030 for RPi and 89700 for I7, 14.9 times faster. These are for single precision calculations using NEON for RPi and SSE for Intel. The MFLOPS/MHz ratio for Intel using AVX 1 instructions is also quoted, indicating i7 MFLOPS of 177840. This MP-MFLOPS test also includes further measurements with fewer calculations and cache or RAM based data. The maximum speeds quoted are based on calculating 32 operations per data word in a for loop:

    x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f-(x[i]+g)*h+(x[i]+j)*k-(x[i]+l)*m+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t-(x[i]+u)*v+(x[i]+w)*y;

For best RPi MP MFLOPS performance, you should see High-Performance Linpack Benchmark results:

https://www.raspberrypi.org/forums/viewtopic.php?p=301458

Roy Longbottom
  • 1,192
  • 1
  • 6
  • 8