C performance and compiling options

Question

I've got two similar implementations (java and c++) for a trivial algorithm like the selection sort.

public interface SortingAlgorithm {

    public void sort(int[] a);
}

public class SelectionSort implements SortingAlgorithm {

    @Override
    public void sort(int[] a) {
        for (int i = 0; i < a.length; i++) {
            int lowerElementIndex = i;
            for (int j = i + 1; j < a.length; j++) {
                if (a[j] < a[lowerElementIndex]) {
                    lowerElementIndex = j;
                }
            }
            swap(a, lowerElementIndex, i);
        }
    }

    private void swap(int[] a, int i, int j) {
        if (i == j) {
            return;
        }
        int temp = a[i];
        a[i] = a[j];
        a[j] = temp;
    }
}

and the c one:

inline void swap(int* a, int i, int j);

void s_sort(int* a, int size) {
  int i;
  for (i = 0; i < size; i++) {
    int lowerElementIndex = i, j;
    for (j = i + 1; j < size; j++) {
      if (a[j] < a[lowerElementIndex]) {
    lowerElementIndex = j;
      }
    }
    swap(a, lowerElementIndex, i);
  }
}

inline void swap(int* a, int i, int j) {
  if (i == j) {
    return;
  }
  int temp = a[i];
  a[i] = a[j];
  a[j] = temp;
}

Now, I tried testing them on a large array (100000 random int). The results at first were java: ~17 sec (compiled and executed with oracle jdk/jvm) c: ~22 sec (compiled with gcc v4.8 without any optimization)

Of course, i then tried to optimize my c version through cflags. The results are(I'm reporting cflags only): -O1: ~18.4

-O2: ~18.4

-O{3-9}: ~20.9

Now, my first question is which cflacs should I use to compile?

So I read the gnu manual about optimizations. Adding -march=native didn't helped. After some time spent trying other options, I came into -fprofile-arcs option. Adding it to my flags made my code finish its test in about 11 seconds! However some files appeared in my folders: the results of the profiling. As I understand, I should use them with -fbranch-probabilities and recompile the code. Recompiling results again in ~18.5 sec. And this is what I really want to ask.

How is it possible for my program to run so fast if it has to write files and collect profiling information and instead it runs 1.5 times slower when it hasn't?

I forgot to mention that I'm on an old PC with a Intel Celeron @2.8GHz processor and linux (fedora 20 with xfce) installed. If you need other information about the hardware just ask! ;)

Edit: The code I use for the test is:

Java:

public class Test {

    public static void main(String[] args) {
        int[] a = new int[100000];
        int[] a2 = new int[100000];
        for (int i = 0; i < a.length; i++) {
            a[i] = (int)(Math.random()*100000);
            a2[i] = a[i];
        }
        SelectionSort s = new SelectionSort();
        InsertionSort s1 = new InsertionSort();
        double start = System.nanoTime();
        s.sort(a);
        double end = System.nanoTime();
        double time = (end-start)/1000000000.0; 
        System.out.println("Selection: "+time);
        start = System.nanoTime();
        s1.sort(a2);
        end = System.nanoTime();
        time = (end-start)/1000000000.0;
        System.out.println("Insertion: "+time);
    }
}

And the c:

#include "insertion_sort.h"
#include "selection_sort.h"
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main() {
  int max = 100000, i;
  srand(time(NULL));

  int array[100000], array2[100000];
  for(i=0; i<100000; i+=1) {
    array[i] = rand()%100000;
  }

  memcpy(array2, &array[0], 100000 * sizeof(int));

  clock_t inizio = clock();
  s_sort(array, max);
  clock_t fine = clock();
  float tempoEsecuzione = (float)(fine - inizio) / CLOCKS_PER_SEC;
  printf("Selection: %2.3f\n", tempoEsecuzione);

  inizio = clock();
  i_sort(array2, max);
  fine = clock();
  tempoEsecuzione = (float)(fine - inizio) / CLOCKS_PER_SEC;
  printf("Insertion: %2.3f\n", tempoEsecuzione);
  return 0;
}

The code contains references to a insertion sort function that I haven't included in the rest of the question because (as expected) java run slower that c.

Can you please share with us the code of the test driver (the part where you generate the random numbers, how do you get the time, etc...)? — gd1, Jan 10 '14 at 17:05
Depending on how your Java code is written, it is possible that the JIT compiler decides not to run your code at all if it realises that it has no side effects. Without seeing your benchmark it is difficult to say if it is realistic. — assylias, Jan 10 '14 at 17:18
@assylias: of course it could, but if it was the case, shouldn't java run REALLY faster? I mean, if no code is execute, than java should exit almost instantly... — marco6, Jan 10 '14 at 17:25
@marco6 It won't optimise the method straight away, so your measurement could be the time it takes the JIT to get rid of unused code. Just to be sure, you should add a `println(s[123] + " " + s1[234]);` at the end to actually use the result of the sorting operation. But seeing your code, it's probably not going to make a difference. — assylias, Jan 10 '14 at 17:27
On my machine, I get 6.48 with Java and approx. 7.78 with C (-O3). I modified your test driver to use the exact same numbers in both versions (I saved the generated numbers on a text file) — gd1, Jan 10 '14 at 17:48
@marco6 OK, I believe I am close to the answer, please check [Why does tree vectorization make this sorting algorithm 2x slower?](http://stackoverflow.com/q/21055946/341970) — Ali, Jan 10 '14 at 22:58
@Ali ok so it seems a problem in the pipeline because of a hazard, are you thinking than that this it's just case the -profilie-arcs thing makes better code? — marco6, Jan 11 '14 at 09:34
@marco6 The `-fprofile-arcs` is making things better because it disables tree vectorization. Other than that, it has no effect on the bottleneck. Please try with `-O2 -fno-tree-vectorize`, that seems to be the relevant flag. — Ali, Jan 11 '14 at 11:12
@Ali yes it works as expected!! :D Thank you! If you update the answer I can accept it! — marco6, Jan 11 '14 at 11:22
@marco6 Glad to hear it helped! Please give me some time, I will update the answer and I will also provide explanation *why* it helps. I will let you know when I am done, please be patient. — Ali, Jan 11 '14 at 11:45
@marco6 OK, I am done, please check the answer. Thanks for your patience. — Ali, Jan 11 '14 at 17:52

assylias · Answer 1 · 2014-01-10T18:05:33.500

Not really an answer, but too long for a comment.

Your java benchmark is far from optimal - in particular, you don't allow the JVM to warmup enough. With proper warmup, on my machine, the time drops by 50% (4s vs 8s). My proposed code (with only the SelectionSort):

public static void main(String[] args) {
    SelectionSort s = new SelectionSort();

    int[] aWarmUp = new int[10];
    int[] a = new int[100000];
    for (int i = 0; i < aWarmUp.length; i++) {
        aWarmUp[i] = (int)(Math.random()*100000);
    }
    for (int i = 0; i < a.length; i++) {
        a[i] = (int)(Math.random()*100000);
    }

    measure(s, a, "Before warmup ");

    for (int i = 0; i < 10000; i++) { //warmup
        s.sort(aWarmUp);
    }


    for (int i = 1; i < 5; i++) {
        System.gc(); //gc before measurement
        //re-fill the array with random numbers
        for (int j = 0; j < a.length; j++) {
            a[j] = (int)(Math.random()*100000);
        }
        measure(s, a, "In loop ");
        System.out.println(a[123]); //use the result
    }
}

private static void measure(SelectionSort s, int[] a, String msg) {
    double start = System.nanoTime();
    s.sort(a);
    double end = System.nanoTime();
    double time = (end-start)/1000000000.0;
    System.out.println(msg + time);
}

output:

Before warmup 7.851840908
In loop 4.055204123
In loop 3.878436395
In loop 3.880136077
In loop 3.882814287

OK, that's good and interesting too, +1, but how does this answer the question? — gd1, Jan 10 '14 at 17:50
ok, @assylias than make me even more confused... How can java be 4 seconds faster than c? — marco6, Jan 10 '14 at 18:24

score 2 · Accepted Answer · edited May 23 '17 at 12:29

And this is what I really want to ask.

How is it possible for my program to run so fast if it has to write files and collect profiling information and instead it runs 1.5 times slower when it hasn't?

Yes, that's the real question here. Mentioning all that Java comparison stuff just adds noise.

I could reproduce the weird behavior on my machine with gcc 4.7.2. Not surprisingly, the hot path of the code is the inner for loop:

for (j = i + 1; j < size; j++) {
  if (a[j] < a[lowerElementIndex]) {
    lowerElementIndex = j;
}

The only relevant difference in the corresponding generated assembly code is:

Fast case:

    cmpl    %esi, %ecx
    jge .L3
    movl    %ecx, %esi
    movslq  %edx, %rdi
.L3:

Slow case:

cmpl    %ecx, %esi
cmovl   %edx, %edi
cmovl   %esi, %ecx

The first case (fast) can greatly benefit from branch prediction but the other (slow case) apparently cannot. Sorted or randomly shuffled arrays do not cause too much branch mispredictions. The first code snippet is optimal in that case.

As it turns out, it is actually difficult to create a dataset that causes a lot of branch mispredictions in selection sort. (It was pointed out by Yakk; see also my attempts to create an evil dataset; so far, I failed to create one.)

The -fprofile-arcs happens to disable tree vectorization which seems to be responsible for generating the slow case code. A better way to disable tree vectorization is to pass the -fno-tree-vectorize flag.

clang 3.4 also generates the fast case code, without any special flag. The Java code without warm up runs 2.4x slower than the C code. (Since it wasn't the question, I haven't looked into improving the Java code performance.)

score 0 · Answer 3 · answered Jan 10 '14 at 17:39

Here are the results I get. For me (gcc 4.6.3) -O3 -funroll-loops wins.

> gcc -o s s.c                                               
> time ./s                                                   
Elapsed time: 13
./s  13.08s user 0.00s system 99% cpu 13.088 total

> gcc -o s s.c -O1                                           
> time ./s                                                   
Elapsed time: 16
./s  16.02s user 0.00s system 99% cpu 16.042 total

> gcc -o s s.c -O2                                           
> time ./s                                                   
Elapsed time: 16
./s  16.06s user 0.00s system 99% cpu 16.076 total

> gcc -o s s.c -O3                                           
> time ./s                                                   
Elapsed time: 7
./s  7.38s user 0.00s system 99% cpu 7.381 total

> gcc -o s s.c -O3 -funroll-loops                            
> time ./s                                                   
Elapsed time: 6
./s  6.04s user 0.00s system 99% cpu 6.046 total

(Note: "Elapsed time" line excludes the time spent in building the test array -- but it's negligible).

On my pc, -O3 -funroll-loops results in ~18.6 secs, so it does not win to me... And still, I can't understand how it goes so fast while doing profiling... — marco6, Jan 10 '14 at 17:45

nos · Answer 4 · 2014-01-10T19:17:47.457

-1

I get a 100% speedup(building with gcc -O2 ) in the C program if I remove the conditional from the swap function. i.e.:

static inline void swap(int* a, int i, int j) {
  int temp = a[i];
  a[i] = a[j];
  a[j] = temp;
}

The generated code is likely highly sensitive to branch prediction or cache pre-fetching, so it seems just a slight difference in the generated code(as e.g. influenced by different compiler flags too) can have a huge impact here.

Note that the overhead of -fprofile-arcs in this program is small. your own time measurements doesn't include writing out the profiling file either, but even writing out that data takes an insignificant amount of time compared to the 5 or 10+ seconds of execution time.

edited Jan 10 '14 at 19:17

answered Jan 10 '14 at 18:52

nos

223,662
58
417
506

1

`if (x != y) x= y;` type logic rarely helps for trivial assignments, I was thinking the same thing – Glenn Teitelbaum Jan 10 '14 at 19:13
Yes! On mine too! (from 18 to 10) While the java version seems have no benefit from this change... Maybe java is just removing that compile-time. That would explain the difference in performace... Am I right? – marco6 Jan 11 '14 at 09:45
Still compiling with -O3 results in a 20s run... (??) Other tests: -O3 -fprofile-arcs: 12s -O2 -fprofile-arcs: 12s So as expected (with -O2) adding -fprofile-arcs does make my code slower, but the difference still remains if -O3 flag is turned on... – marco6 Jan 11 '14 at 09:51
Toy cannot get 100% speedup, that would mean that you spend no time at all. 50% maybe. – VladP Dec 20 '17 at 22:30

C performance and compiling options

4 Answers4

Linked