0

Platform: OpenBSD, compiler: gcc, javac (OpenJDK version 17)

I have made a few sorting benchmarks in various languages, and I'm rather surprised by the performance of the Java program over the C program.

I have programmed the exact same sorting algorithms in both languages, and the Java program finishes almost twice as fast, all other languages are slower than the C implementation except the Java one.

The benchmarks involve running the sorting algorithm on a random array of numbers a set number of times.

I am compiling the program with -O3 and -Ofast, so I cannot apply any more compiler optimizations.

The exact code can be found here, but here is an excerpt from it:

Java:

public static void benchmark(SortingFunction func, int arraySize, int numTimes, String name, BufferedOutputStream bo) throws IOException {
    int[][] arrs = new int[numTimes][arraySize];
    for (int i = 0; i < numTimes; i ++) {
      arrs[i] = genRandArray(arraySize);
    }
    long start = System.nanoTime();
    for (int i = 0; i < numTimes; i ++) {
      func.sort(arrs[i]);
    }
    long end = System.nanoTime();
    double time = (double)(end - start) / 1e9;
    System.out.println("It took " + time + " seconds to do " + name + " sort " +
            numTimes + " times on arrays of size " + arraySize
    );
    String out = name+","+numTimes+","+arraySize+","+time;
    for (char c : out.toCharArray()) {
      bo.write(c);
    }
    bo.write('\n');
  }
public static void insertionSort(int[] array) {
    for (int i = 1; i < array.length; i ++) {
      int temp = array[i];
      int j;
      for (j = i - 1; j >= 0 && array[j] > temp; j --) {
        array[j+1] = array[j];
      }
      array[j+1] = temp;
    }
  }

C:

void benchmark(void (*f)(int *, int), int arr_size, int num_times, char *name,
               FILE *fp) {
  int *arrs[num_times];
  struct timeval start, end;
  double t;
  for (int i = 0; i < num_times; i++) {
    arrs[i] = gen_rand_arr(arr_size);
  }
  gettimeofday(&start, NULL);
  for (int i = 0; i < num_times; i++) {
    f(arrs[i], arr_size);
  }
  gettimeofday(&end, NULL);
  for (int i = 0; i < num_times; i++) {
    free(arrs[i]);
  }
  t = ((double)(end.tv_sec * 1000000 + end.tv_usec -
                (start.tv_sec * 1000000 + start.tv_usec))) /
      1000000;
  printf("It took %f seconds to do %s sort %d times on arrays of size %d\n", t,
         name, num_times, arr_size);

  if (fp != NULL) {
    fprintf(fp, "%s,%d,%d,%f\n", name, num_times, arr_size, t);
  }
}
void insertion_sort(int *arr, int arr_size) {
  for (int i = 1; i < arr_size; i++) {
    int temp = arr[i];
    int j;
    for (j = i - 1; j >= 0 && *(arr + j) > temp; j--) {
      arr[j + 1] = arr[j];
    }
    arr[j + 1] = temp;
  }
  return;
}

Are there some optimizations that Java is making to run faster that somehow change the algorithm? What is going on here?

Any explanations would be appreciated.

Here is a table of results that might help explain the difference:

Java:

name rep size time
Insertion 10000 1200 1.033
Insertion 10000 5000 15.677
Insertion 10000 12000 88.190
Selection 10000 1200 3.118
Selection 10000 5000 48.377
Selection 10000 12000 268.608
Radix 10000 1200 0.385
Radix 10000 5000 1.491
Radix 10000 12000 3.563
Bogo 1 11 1.330
Bogo 1 12 0.572
Bogo 1 13 122.777

C:

name rep size time
Insertion 10000 1200 1.766
Insertion 10000 5000 26.106
Insertion 10000 12000 140.582
Selection 10000 1200 4.011
Selection 10000 5000 60.442
Selection 10000 12000 340.608
Radix 10000 1200 0.430
Radix 10000 5000 1.788
Radix 10000 12000 4.295
Bogo 1 11 1.378
Bogo 1 12 2.296
Bogo 1 13 1586.73

Edit:

I modified the benchmarking function to generate the arrays beforehand, in Java it overflows the heap, and in C it makes it not much faster (around 25%, but Java is still faster).

fwiw I also changed how I indexed things in C from *(arr + i) to arr[i].

Here's the implementation for the random array generator in Java and C:

Java:

  public static int[] genRandArray(int arraySize) {
    int[] ret = new int[arraySize];
    Random rand = new Random();
    for (int i = 0; i < ret.length; i ++) {
      ret[i] = rand.nextInt(100);
    }
    return ret;
  }

C:

int *gen_rand_arr(int arr_size) {
  int *arr;
  if ((arr = malloc(arr_size * sizeof(int))) == NULL) {
    exit(1);
  }
  for (int i = 0; i < arr_size; i++) {
    arr[i] = arc4random_uniform(100);
  }
  return arr;
}
  • 8
    Do they sort the same data or is it a knock-on of the PRNG implementation? Try loading the same data from a prepared file (and check that the sorting worked too). More seriously, you seem to be including the array allocation and generation in the timing. – Weather Vane Dec 13 '21 at 12:17
  • You're likely benchmarking the RNG instead of your implementation. – Andrew Henle Dec 13 '21 at 12:26
  • please add implementation of `genRandArray` for both C and Java – tstanisl Dec 13 '21 at 13:38
  • 2
    @AndrewHenle, the timing scales with O(N^2), therefore the execution time is likely dominated by sorting (at least for Selection and Insertion sort). – tstanisl Dec 13 '21 at 13:41
  • 1
    don't use `*(arr + i)` in C. Simply use `arr[i]` as in Java – tstanisl Dec 13 '21 at 13:50
  • Use `size_t` for array indexes (and associated variables) in the C version? – pmg Dec 13 '21 at 17:16
  • 1
    Don't just time it until you know the code is optimal. The way you make it optimal is to remove activities that unnecessarily and unpredictably cost a large fraction of time. You don't do that by eyeballing the code or counting on the compiler optimizer. You do it by randomly halting it under the debugger to see what it's doing. Do this on *unoptimized* code. After you've made the unoptimized code as fast as possible, then turn on the compiler's optimizer. – Mike Dunlavey Dec 13 '21 at 17:32
  • **Memory allocation and free of C is slower than java's garbage collection** (though java uses threads!). And different algorithms for random number generation is a large unknown. – Joop Eggen Dec 13 '21 at 17:35
  • The 'slow' c code is actually C++, which is NOT fast – user3629249 Dec 13 '21 at 20:18
  • 1
    and you also need to do a proper Java microbenchmark which is difficult: [How do I write a correct micro-benchmark in Java?](https://stackoverflow.com/q/504103/995714) – phuclv Dec 14 '21 at 04:55
  • What CPU exactly? What version of what C compiler? What JVM version? – Peter Cordes Dec 15 '21 at 08:55
  • Sigh... not this again. Post what system, CPU, compiler, compiler options and so on that you are using. – Lundin Dec 15 '21 at 12:22

1 Answers1

2

TL;DR

In general, short snippets like this are not a fair way to compare languages. There are a lot of factors that comes into play. Code does not automatically get faster when you write it in C instead of Java. If that were the case, you could just write a Java2C converter. Compiler flags matters a lot, but also the skill of the programmer.

Longer explanation

I cannot say for sure, but this:

for (j = i - 1; j >= 0 && arr[j] > temp; j--) {
  arr[j + 1] = arr[j];
}

is not very cache friendly, because you're traversing the list backwards. I would try changing the loop so that the outer loop do the backwards traversing instead of the inner loop.

But I'd say that your question is fundamentally flawed. Code does not automatically get a performance boost just because you rewrite it from Java to C. In the same way, C programs does not automatically get faster because you rewrite them to assembly. One could say that C allows you to write faster programs than Java, but in the end, the result depend on the programmer.

One thing that can speed up Java programs is the JIT compiler, which can do statistics to optimize the code during runtime for the specific conditions right there and then. While it is possible to make a C compiler to optimize for typical workload, it cannot optimize for current workload.

You said that you used -O3 for the C code. But what target did you use? Did you optimize for your machine or a general one? The JIT compiler knows the target to optimize for. Try using -march=native

Are you sure that you're using the same size for int? It's 32 bit in Java, but might be 64 in C. It might speed up the C code if you switch to int32_t instead. But it might also slow it down. (Very unlikely that this is the cause, but I just wanted to mention it as a possibility)

C usually shines when it comes to very low level stuff.

And if we look in your Java code:

for (int i = 1; i < array.length; i ++) {
      int temp = array[i];

In this example, a smart compiler can easily see that array will never be accessed out of bounds. But what if we instead would have something like:

while(<condition>) {
      int temp = array[foo()];

where it cannot be determined beforehand that array will not go out of bounds? Then Java is forced to do constant boundary checking to be able to throw exceptions. The code would be translated to something like:

while(<condition>) {
      int i = foo();
      if(i >= array.length)
           throw exception;
      int temp = array[i];

This gives security, but costs performance. C would simply allow you to access out of bounds, which is faster but may cause bugs that are hard to find.

I found a nice question with more info: Why would it ever be possible for Java to be faster than C++?

Apart from that, I can see that you're including the data generation in the benchmark. That's very bad. Generate the data before starting the timer. Like this:

int *arrs[num_times];

for (int i = 0; i < num_times; i++) 
    arrs[i] = gen_rand_arr(arr_size);

gettimeofday(&start, NULL);

for (int i = 0; i < num_times; i++) 
    f(arrs[i], arr_size);

gettimeofday(&end, NULL);
klutt
  • 30,332
  • 17
  • 55
  • 95
  • 1
    The proposal won't have any sorting to do on subsequent iterations. – Weather Vane Dec 13 '21 at 12:34
  • @WeatherVane Huh? Explain – klutt Dec 13 '21 at 12:37
  • Oh, you've made an array of arrays to use each time. – Weather Vane Dec 13 '21 at 13:02
  • @WeatherVane Technically, it's an array of pointers, but yes, that's what I did. ;) – klutt Dec 13 '21 at 13:33
  • I implemented this and it has not changed the time difference between C and Java, though it did make both faster. – lightningx10 Dec 13 '21 at 17:09
  • 1
    @lightningx10 I'm a bit curious here. Did you test only the first one alone? That would be very interesting, because I was only speculating. – klutt Dec 13 '21 at 21:02
  • @klutt I changed only the second tip with the arrays generated beforehand. The performance difference is not just in this sorting algorithm but in any n^2 sorting algorithm that I give either language, which seems strange as C is typically faster at this kind of stuff. – lightningx10 Dec 15 '21 at 03:56
  • *It's 32 bit in Java, but might be 64 in C.* There are no mainstream C implementations for x86 that use 64-bit `int`. Or for other modern mainstream CPUs. I hope the querent would have told us if they were compiling for ancient Cray hardware. (And likely would have mentioned if they were using something other than an x86 PC running Windows; such folks are the ones who are most likely to be oblivious to the existence of other kinds of computers. Although the `-O3` / `-Ofast` means gcc or clang, so I'm less confident about my Windows guess; could easily be Linux, maybe even MacOS.) – Peter Cordes Dec 15 '21 at 08:55
  • Your other points are sensible, though, e.g. `-march=native` vs. GCC/clang's default of `-mtune=generic`. – Peter Cordes Dec 15 '21 at 09:01
  • @PeterCordes Yes, it's true about 32/64. It was basically just speculating to illustrate that small details sometimes may have huge effects, and that it's pretty tricky to do a fair comparison. – klutt Dec 15 '21 at 09:03
  • Sure, but that specific example is unfortunately not great. You could have pointed out that using `int i` instead of `size_t i` or `intptr_t i` might be making a compiler redo sign-extension of a narrow index inside the loop, since that's something that can happen on modern x86-64 systems. Or AArch64, but AArch64 has an addressing mode that uses a 32-bit integer with a 64-bit pointer for free, IIRC. (In practice GCC and clang are pretty good about taking advantage of signed-overflow being UB to widen loop counters used as indices up to pointer width, so it shouldn't hurt the loop.) – Peter Cordes Dec 15 '21 at 09:34
  • And of course it's always possible that a specific missed-optimization is tripping up a (micro) benchmark that spends all its time in one small loop. Or even microarchitectural effects, e.g. the Skylake JCC-erratum mitigation that can hurt those CPUs depending on code alignment relative to a 32-byte boundary. As in [Why does my empty loop run twice as fast if called as a function, on Intel Skylake CPUs?](https://stackoverflow.com/a/67878812) – Peter Cordes Dec 15 '21 at 09:37
  • @PeterCordes I have already edited that paragraph – klutt Dec 15 '21 at 09:37
  • @PeterCordes Also, feel free to add an answer :) – klutt Dec 15 '21 at 09:38
  • I'd consider answering if / when the querent includes any details on CPU / compilers / versions, so I'd know what asm to look at. (Probably actually need disassembly to look for JCC-erratum pitfalls; a difference in linking could change alignment relative to a 32B boundary, so reproing the asm on https://godbolt.org/ wouldn't be sufficient. The link to source code on gitlab is totally insufficient without any compiler / CPU details, since it's the same logic so the devil is in exactly how it compiled.) – Peter Cordes Dec 15 '21 at 09:39