I've got two similar implementations (java and c++) for a trivial algorithm like the selection sort.
public interface SortingAlgorithm {
public void sort(int[] a);
}
public class SelectionSort implements SortingAlgorithm {
@Override
public void sort(int[] a) {
for (int i = 0; i < a.length; i++) {
int lowerElementIndex = i;
for (int j = i + 1; j < a.length; j++) {
if (a[j] < a[lowerElementIndex]) {
lowerElementIndex = j;
}
}
swap(a, lowerElementIndex, i);
}
}
private void swap(int[] a, int i, int j) {
if (i == j) {
return;
}
int temp = a[i];
a[i] = a[j];
a[j] = temp;
}
}
and the c one:
inline void swap(int* a, int i, int j);
void s_sort(int* a, int size) {
int i;
for (i = 0; i < size; i++) {
int lowerElementIndex = i, j;
for (j = i + 1; j < size; j++) {
if (a[j] < a[lowerElementIndex]) {
lowerElementIndex = j;
}
}
swap(a, lowerElementIndex, i);
}
}
inline void swap(int* a, int i, int j) {
if (i == j) {
return;
}
int temp = a[i];
a[i] = a[j];
a[j] = temp;
}
Now, I tried testing them on a large array (100000 random int). The results at first were java: ~17 sec (compiled and executed with oracle jdk/jvm) c: ~22 sec (compiled with gcc v4.8 without any optimization)
Of course, i then tried to optimize my c version through cflags. The results are(I'm reporting cflags only): -O1: ~18.4
-O2: ~18.4
-O{3-9}: ~20.9
Now, my first question is which cflacs should I use to compile?
So I read the gnu manual about optimizations. Adding -march=native didn't helped. After some time spent trying other options, I came into -fprofile-arcs option. Adding it to my flags made my code finish its test in about 11 seconds! However some files appeared in my folders: the results of the profiling. As I understand, I should use them with -fbranch-probabilities and recompile the code. Recompiling results again in ~18.5 sec. And this is what I really want to ask.
How is it possible for my program to run so fast if it has to write files and collect profiling information and instead it runs 1.5 times slower when it hasn't?
I forgot to mention that I'm on an old PC with a Intel Celeron @2.8GHz processor and linux (fedora 20 with xfce) installed. If you need other information about the hardware just ask! ;)
Edit: The code I use for the test is:
Java:
public class Test {
public static void main(String[] args) {
int[] a = new int[100000];
int[] a2 = new int[100000];
for (int i = 0; i < a.length; i++) {
a[i] = (int)(Math.random()*100000);
a2[i] = a[i];
}
SelectionSort s = new SelectionSort();
InsertionSort s1 = new InsertionSort();
double start = System.nanoTime();
s.sort(a);
double end = System.nanoTime();
double time = (end-start)/1000000000.0;
System.out.println("Selection: "+time);
start = System.nanoTime();
s1.sort(a2);
end = System.nanoTime();
time = (end-start)/1000000000.0;
System.out.println("Insertion: "+time);
}
}
And the c:
#include "insertion_sort.h"
#include "selection_sort.h"
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main() {
int max = 100000, i;
srand(time(NULL));
int array[100000], array2[100000];
for(i=0; i<100000; i+=1) {
array[i] = rand()%100000;
}
memcpy(array2, &array[0], 100000 * sizeof(int));
clock_t inizio = clock();
s_sort(array, max);
clock_t fine = clock();
float tempoEsecuzione = (float)(fine - inizio) / CLOCKS_PER_SEC;
printf("Selection: %2.3f\n", tempoEsecuzione);
inizio = clock();
i_sort(array2, max);
fine = clock();
tempoEsecuzione = (float)(fine - inizio) / CLOCKS_PER_SEC;
printf("Insertion: %2.3f\n", tempoEsecuzione);
return 0;
}
The code contains references to a insertion sort function that I haven't included in the rest of the question because (as expected) java run slower that c.