2

I wrote a small benchmark, in which the program creates 108 2-Dimensional std::vector structures of {float, float}, and then sums up the square of their lengths.

Here is the C++ code:

#include <iostream>
#include <chrono>
#include <vector>
#include <array>
#include <cmath>
    
using namespace std;
using namespace std::chrono;
    
const int COUNT = pow(10, 8);
    
class Vec {
public:
    float x, y;
    
    Vec() {}
    
    Vec(float x, float y) : x(x), y(y) {}
    
    float len() {
        return x * x + y * y;
    }
};
    
int main() {
    vector <Vec> vecs;
    
    for(int i = 0; i < COUNT; ++i) {
        vecs.emplace_back(i / 3, i / 5);
    }
    
    auto start = high_resolution_clock::now();
    
    // This loop is timed
    float sum = 0;
        for(int i = 0; i < COUNT; ++i) {
        sum += vecs[i].len();
    }
    
    auto stop = high_resolution_clock::now();
    
    cout << "finished in " << duration_cast <milliseconds> (stop - start).count()
         << " milliseconds" << endl;
    cout << "result: " << sum << endl;
    
    return 0;
}

For which I used this makefile (g++ version 7.5.0):

build:
 g++ -std=c++17 -O3 main.cpp -o program #-ffast-math 
    
run: build
 clear
 ./program

Here is my Java code:

public class MainClass {
    static final int COUNT = (int) Math.pow(10, 8);

    static class Vec {
        float x, y;

        Vec(float x, float y) {
            this.x = x;
            this.y = y;
        }

        float len() {
            return x * x + y * y;
        }
    }

    public static void main(String[] args) throws InterruptedException {

        Vec[] vecs = new Vec[COUNT];

        for (int i = 0; i < COUNT; ++i) {
            vecs[i] = new Vec(i / 3, i / 5);
        }

        long start = System.nanoTime();

        // This loop is timed
        float sum = 0;
        for (int i = 0; i < COUNT; ++i) {
            sum += vecs[i].len();
        }

        long duration = System.nanoTime() - start;
        System.out.println("finished in " + duration / 1000000 + " milliseconds");
        System.out.println("result: " + sum);
    }
}

Which was compiled and ran using Java 11.0.4

And here are the results (the average of a few runs, ran on ubuntu 18.04 16bit):

c++:  262 ms
java: 230 ms

Here are a few things I have tried in order to make the c++ code faster:

  • Use std::array instead of std::vector
  • Use a plain array instead of std::vector
  • Use an iterator in the for loop

However, none of the above resulted in any improvement.

I have noticed a few interesting things:

  1. When I time the whole main() function (allocation + computation), C++ is much better. However, this might be due to the warm up time of the JVM.
  2. For a lower number of objects, like 107, C++ was slightly faster (by a few milliseconds).
  3. Turning on -ffast-math makes the C++ program a few times faster than Java, however the result of the computation is slightly different. Furthermore, I read in a few posts that using this flag is unsafe.

Can I somehow modify my C++ code and make it as fast or faster than Java in this case?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Roy Varon
  • 536
  • 2
  • 5
  • 14
  • I don't think such a 15% difference is significant. To answer to such a question one needs to go to assembly code which is even harder with the JVM... So without having all the details (compiler version, OS, ...) I don't think anyone can answer. – hivert Feb 20 '21 at 15:46
  • Possibly due to one binary running 32 bit code, and the other running 64 bit? – 1201ProgramAlarm Feb 20 '21 at 15:48
  • Thank you for your comments. Both programs indeed run in 64bit mode. Using reserve, did not increase the performance. COUNT is identical in both programs. – Roy Varon Feb 20 '21 at 15:55
  • Yes, as written in the Makefile above, I indeed used ```-O3``` – Roy Varon Feb 20 '21 at 15:57
  • @RoyVaron Using reserve would probably (definitely) improve the performance if the benchmark was of the whole program. I deleted my comment after realizing that the filling-up of the `vector` isn't actually in the timed section... – Telescope Feb 20 '21 at 15:57
  • `10^8` (`pow(10, 8)`) overflows a 32 bit `int` so `COUNT` on the 2 systems could be different. Output the value of `COUNT` to check please. Also in C++ is UB so all bets are off. – Richard Critten Feb 20 '21 at 16:00
  • Java code can in fact be faster than the equivalent C++ code but only if the code is JITed properly and even if that is the case, jaca is rarely faster. If you want to properly benchmark java code, you should use something like [JMH](https://github.com/openjdk/jmh) as correct java microbenchmarks are hard. – dan1st Feb 20 '21 at 16:00
  • @AlessandroTeruzzi Yes, but I am not timing the creating of the array or the vector. I am only timing the processing of the data, please see where I put the ```start``` variable. – Roy Varon Feb 20 '21 at 16:03
  • i wonder how much of the constexpr java figured out at "compile"... and whether C++ figured to unroll that loop into an iteration over the vector. try changing that for loop notation and making the vector a c-style array – Abel Feb 20 '21 at 16:03
  • @RichardCritten I checked and in both the java and c++ programs COUNT equals to ```100000000``` (I printed it). – Roy Varon Feb 20 '21 at 16:04
  • @Abel thats a good point. I did try to unroll the loops myself, however when doing the same unrolling for the java and the c++ program the java program is still faster, there difference however was significantly reduced. Also changing it to a C styled array did not improve performance. – Roy Varon Feb 20 '21 at 16:06
  • If you increase the number of objects even more, does the difference increase? – dan1st Feb 20 '21 at 16:08
  • @dan1st yes, after I increased the number of objects from 10^8 to 11^8 the difference grew from 30ms to 50ms. – Roy Varon Feb 20 '21 at 16:10
  • Wouldn't the `len()` function prevent fused multiply-add instructions from being emitted, which would be hindering optimizations? The result of `z += (x * x + y * y)` will be different from `z += (x * x); z += (y * y)`, so the C++ compiler should not be able to perform this optimization without `-ffast-math` because it loses potential accuracy. I don't know much about what Java requires, but it's possible that Java doesn't force the accuracy the same way and is already doing a fused multiply/add instruction due to the bytecode sequence – Human-Compiler Feb 20 '21 at 16:18
  • Does this work any better? https://gcc.godbolt.org/z/cG7s3e – Aykhan Hagverdili Feb 20 '21 at 16:18
  • @AyxanHaqverdili unfortunately I cannot test your code since its written in C++20, and my compiler does not support it yet (thats why I specified -std=c++17 in my makefile). – Roy Varon Feb 20 '21 at 16:24
  • I think that the JIT just performed better optimizations that gcc as it could e.g. use heuristics and runtime analysis. – dan1st Feb 20 '21 at 16:24
  • @RoyVaron try this https://gcc.godbolt.org/z/Gx35rh – Aykhan Hagverdili Feb 20 '21 at 16:29
  • @AyxanHaqverdili Oh wow it significantly improved the performance. The c++ code now runs 100 ms fast than java. However the result is for some reason different: it changed from 3.7778932E22 (my code, java and c++) to 5.03704e+22 (your code). – Roy Varon Feb 20 '21 at 16:39
  • c style array was more for a for loop notation to be unrolled without changing notation much. JVM has some neat runtime ops for loops. if you got them both to unroll, the only thing after that I can think of would be to force both to be locked to a single core for deterministic tests. – Abel Feb 20 '21 at 16:39
  • @RoyVaron does this give the same result? https://gcc.godbolt.org/z/b3EEaq – Aykhan Hagverdili Feb 20 '21 at 16:44
  • @AyxanHaqverdili that indeed fixed it. Now the c++ code is both faster and gives the same result! is the secret in the ```std::accumulate``` function? – Roy Varon Feb 20 '21 at 16:48
  • @AyxanHaqverdili I have played a bit with your code. I gradually copied one change after another from your code to mine to see what part exactly cause the biggest performance improvement. The biggest improvement by far was gained by making the container static. It didn't matter much if it was a c-styled array or an ```std::vector```. Simply making it static reduced 120 ms from the run time! But why is that? – Roy Varon Feb 20 '21 at 17:04
  • @RoyVaron I am guessing it has something to do with data locality. I am not sure why it makes a difference in the case of `std::vector`. See the generated assembly in all cases. – Aykhan Hagverdili Feb 20 '21 at 17:08
  • @AyxanHaqverdili thank you a lot for your insight. If you submit an answer I will gladly accept it. – Roy Varon Feb 20 '21 at 17:42
  • @Roy I like the current answer better. It's much more elegant. – Aykhan Hagverdili Feb 20 '21 at 19:55
  • @royvaron if you are still around, I'm curious what the timing of the 10^8 case is in C++ using the transform reduce algorithm is compared to Java. – Yakk - Adam Nevraumont Feb 23 '21 at 04:16
  • @Yakk-AdamNevraumont unfortunately I couldn't use the transform reduce code because its a C++20 feature and my compiler supports only up to c++ 17. – Roy Varon Mar 17 '21 at 22:13

1 Answers1

5

Try this:

    float sum = std::transform_reduce(
        std::execution::par_unseq,
        begin(vecs), end(vecs),
        0.f,
        std::plus<>{},
        [](auto&& x){
            return x.len();
        }
    );

this explicitly tells the C++ compiler what you are doing, that you are ok with using extra threads, that each loop iteration doesn't depend on the others, and that you want to do the work in floats.

It does mean that the addition may occur out of order compared to what you asked for, so the output value may not be exactly the same.

Live example with the raw loop on one side, and permission to out-of-order add on the other.


Further investigation:

So I spun up a godbolt.

In it, I compared this with and without forced vectorization and -ffast-math. Forcing vectorization and -ffast-math resulted in the same assembly code.

The issue is the accumulator. Adding things one at a time into the sum and doing all of the IEEE rounding gives you a different value than accumulating them N at a time in higher precision floating point values, then storing the result en-bulk back into the float.

If you do -ffast-math you'll get 2x speed and a different accumulation. If you replace the float sum with double sum, you'll get the same answer as --ffast-math and vectorizaton on.

Basically, the clang vectorizer doesn't see an easy way to vectorize the accumulation of the sums without breaking the exact float-precision floating point requirements.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • I forgot `transform_reduce` even exists! How much difference does `__attribute__((always_inline))` make? I would expect the compiler to inline that 1 line function regardles. – Aykhan Hagverdili Feb 20 '21 at 17:16
  • Does the fact that you removed the constructors matter? It makes the class trivially-constructible, so that must matter a bit. – Aykhan Hagverdili Feb 20 '21 at 17:24
  • @AyxanHaqverdili I was just iteratively removing "cruft" that could get in the way of the compiler knowing that `Vec` was just some flat binary data (trivially copyable etc). I doubt it matters. Note that I'm simulating with only 10^7 elements, as godbolt complains if you ask for 10^8. ;) – Yakk - Adam Nevraumont Feb 20 '21 at 17:26
  • `#pragma omp simd reduction(+:sum)` compiled with `-fopenmp` will give the compiler permission to pretend that FP math is associative for that reduction without having to use `-ffast-math` for the whole file, if you don't want to. – Peter Cordes Feb 20 '21 at 23:17
  • Also note that using multiple accumulators (elements of a SIMD vector, and multiple SIMD vectors) is generally a *good* thing for numerical accuracy. The result is different because it probably has *less* rounding error than the naive scalar version, closer to pairwise summation. ([Simd matmul program gives different numerical results](https://stackoverflow.com/q/55477701) / [Efficient stable sum of a sorted array in AVX2](https://stackoverflow.com/q/46119811)). It's not true in general that `-ffast-math` will make your results more accurate, though! – Peter Cordes Feb 20 '21 at 23:24