1

I implemented and bench marked the following functions in c and java using the following code. For c, I'm getting about 1.688852 seconds while for java, it only takes like 0.355038 seconds. Even if I remove the sqrt function, inline the code manually or change function signature to accept 6 double coordinates (to avoid accessing via pointers) for c time lapsed doesn't improve much.

I'm compiling the c program like cc -O2 main.c -lm. For java, I'm running the application in intellij idea with the default jvm options (java 8, openjdk).

c:

#include <math.h>
#include <stdio.h>
#include <time.h>

typedef struct point3d
{
  double x;
  double y;
  double z;
} point3d_t;

double distance(point3d_t *from, point3d_t *to);

int main(int argc, char const *argv[])
{
  point3d_t from = {.x = 2.3, .y = 3.45, .z = 4.56};
  point3d_t to = {.x = 5.678, .y = 3.45, .z = -9.0781};

  double time = 0.0;
  int count = 10000000;

  for (size_t i = 0; i < count; i++)
  {
    clock_t tic = clock();
    double d = distance(&from, &to);
    clock_t toc = clock();
    time += ((double) (toc - tic) / CLOCKS_PER_SEC);
  }

  printf("Elapsed: %f seconds\n", time);

  return 0;
}

double distance(point3d_t *from, point3d_t *to)
{
  double dx = to->x - from->x;
  double dy = to->y - from->y;
  double dz = to->z - from->z;

  double d2 = (dx * dx) + (dy * dy) + (dz + dz);
  return sqrt(d2);
}

java:

public class App 
{
    static Random rnd = new Random();

    public static void main( String[] args )
    {
        var sw = new StopWatch();
        var time = 0.0;
        var count = 10000000;

        for (int i = 0; i < count; i++) {
            var from = Vector3D.of(rnd.nextDouble(), rnd.nextDouble(), rnd.nextDouble());
            var to = Vector3D.of(rnd.nextDouble(), rnd.nextDouble(), rnd.nextDouble());

            sw.start();
            var dist = distance(from, to);
            sw.stop();
            time += sw.getTime(TimeUnit.NANOSECONDS);
            sw.reset();
        }

        System.out.printf("Time: %f seconds\n", time / 1e09);
    }

    public static double distance(Vector3D from, Vector3D to) {
        var dx = to.getX() - from.getX();
        var dy = to.getY() - from.getY();
        var dz = to.getZ() - from.getZ();

        return Math.sqrt((dx * dx) + (dy * dy) + (dz * dz));
    }
}

My objective is to understand why the c program is slower and get it to work faster than the java program.

EDIT: I'm using random values in the java program to try and make sure jvm doesn't do anything funny like caching the result and sidestep computations altogether.

EDIT: Updating the two snippets for c to use clock_gettime(), to clock the time taken for all the loops rather than the method call and also to not toss the result from the method call:

#include <math.h>
#include <stdio.h>
#include <time.h>

typedef struct point3d
{
  double x;
  double y;
  double z;
} point3d_t;

double distance(point3d_t *from, point3d_t *to);

int main(int argc, char const *argv[])
{
  point3d_t from = {.x = 2.3, .y = 3.45, .z = 4.56};
  point3d_t to = {.x = 5.678, .y = 3.45, .z = -9.0781};

  struct timespec fs;
  struct timespec ts;

  long time = 0;
  int count = 10000000;
  double dist = 0;

  clock_gettime(CLOCK_REALTIME, &fs);

  for (size_t i = 0; i < count; i++)
  {
    dist = distance(&from, &to);
  }

  clock_gettime(CLOCK_REALTIME, &ts);
  time = ts.tv_nsec - fs.tv_nsec;

  if (dist == 0.001)
  {
    printf("hello\n");
  }

  printf("Elapsed: %f sec\n", (double) time / 1e9);

  return 0;
}

double distance(point3d_t *from, point3d_t *to)
{
  double dx = to->x - from->x;
  double dy = to->y - from->y;
  double dz = to->z - from->z;

  double d2 = (dx * dx) + (dy * dy) + (dz + dz);
  return sqrt(d2);
}

java:

public class App 
{
    static Random rnd = new Random();

    public static void main( String[] args )
    {
        var from = Vector3D.of(rnd.nextDouble(), rnd.nextDouble(), rnd.nextDouble());
        var to = Vector3D.of(rnd.nextDouble(), rnd.nextDouble(), rnd.nextDouble());

        var time = 0.0;
        var count = 10000000;
        double dist = 0.0;

        var start = System.nanoTime();
        for (int i = 0; i < count; i++) {
            dist = distance(from, to);
        }

        var end = System.nanoTime();
        time = end - start;

        if (dist == rnd.nextDouble()) {
            System.out.printf("hello! %f\n", dist);
        }

        dist = dist + 1;
        System.out.printf("Time: %f sec\n", (double) time / 1e9);
        System.out.printf("Yohoo! %f\n", dist);
    }

    public static double distance(Vector3D from, Vector3D to) {
        var dx= to.getX() - from.getX();
        var dy = to.getY() - from.getY();
        var dz = to.getZ() - from.getZ();

        return Math.sqrt((dx * dx) + (dy * dy) + (dz * dz));
    }
}

Compiling c code using gcc -Wall -std=gnu99 -O2 main.c -lm. The results now are 0.06323 seconds for c code and 0.006325 seconds for java.

EDIT: As Jérôme Richard and Peter Cordes pointed out my bench marking is just flawed, not to mention I was taking the sqrt of a negative number in the c version. So, as soon as I compiled the c program with -fno-math-errno, it clocked 0 seconds. I compiled c program like gcc -O2 -std=gnu99 main.c -lm. Now the c program is clocking effectively zero seconds (271 ns) while java clocking 0.007217 seconds. Everything's in order :)

Below is the final code:

c:

#include <math.h>
#include <stdio.h>
#include <time.h>

typedef struct point3d
{
  double x;
  double y;
  double z;
} point3d_t;

double distance(point3d_t *from, point3d_t *to);

int main(int argc, char const *argv[])
{
  point3d_t from = {.x = 2.3, .y = 3.45, .z = 4.56};
  point3d_t to = {.x = 5.678, .y = 3.45, .z = -9.0781};

  struct timespec fs;
  struct timespec ts;

  long time = 0;
  int count = 10000000;
  double dist = 0;

  clock_gettime(CLOCK_REALTIME, &fs);

  for (size_t i = 0; i < count; i++)
  {
    dist = distance(&from, &to);
  }

  clock_gettime(CLOCK_REALTIME, &ts);
  time = ts.tv_nsec - fs.tv_nsec;

  printf("hello %f \n", dist);
  printf("Elapsed: %f ns\n", (double) time);
  printf("Elapsed: %f sec\n", (double) time / 1e9);

  return 0;
}

double distance(point3d_t *from, point3d_t *to)
{
  double dx = (to->x) - (from->x);
  double dy = (to->y) - (from->y);
  double dz = (to->z) - (from->z);

  double d2 = (dx * dx) + (dy * dy) + (dz * dz);
  return sqrt(d2);
}

java:

public class App 
{
    static Random rnd = new Random();

    public static void main( String[] args )
    {
        var from = Vector3D.of(2.3, 3.45, 4.56);
        var to = Vector3D.of(5.678, 3.45, -9.0781);

        var time = 0.0;
        var count = 10000000;
        double dist = 0.0;

        var start = System.nanoTime();
        for (int i = 0; i < count; i++) {
            dist = distance(from, to);
        }

        var end = System.nanoTime();
        time = end - start;

        System.out.printf("Yohoo! %f\n", dist);
        System.out.printf("Time: %f ns\n", (double) time / 1e9);
    }

    public static double distance(Vector3D from, Vector3D to) {
        var dx = to.getX() - from.getX();
        var dy = to.getY() - from.getY();
        var dz = to.getZ() - from.getZ();

        var d2 =  (dx * dx) + (dy * dy) + (dz * dz);
        return Math.sqrt(d2);
    }
}
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
kovac
  • 4,945
  • 9
  • 47
  • 90
  • 1
    In the C version, measure with [`clock_gettime()`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/clock_getres.html) rather than `clock()` ... resolution of `clock_gettime()` is much finer than of `clock()`... closer (equal?) to Java `StopWatch()`. Also make sure both clocks are measuting the same thing: wall-clock time, processor time, thread time, ... – pmg Oct 09 '21 at 15:20
  • You should use a profiler in C, not try to compute the time directly. gcc -Wall -pg ... – alinsoar Oct 09 '21 at 16:29

1 Answers1

5

First of all, the method used to measure the timings is very imprecise. The current methods introduce a huge bias that is probably bigger than the measure itself. Indeed, clock is not very precise on many platforms (about 1 ms on my machine and often not better than 1 us on almost all platforms). Moreover, the imprecision is strongly amplified by the 10,000,000 iterations. If you want to measure the loop precisely, you need to move the clock call outside the loop (and if possible use a more accurate measurement function).

Still, the main problem is that the function result is unused and the Java JIT can see that and partially optimize that. GCC cannot because of the math function standard behaviour (errno cause a side effect not available in the Java code). You can disable the use of errno using the command-line flag -fno-math-errno. With that GCC can now totally optimize the function (ie. remove the function call) and the resulting time is much smaller. However, the benchmark is flawed and you probably do not want to measure that. If you want to write a correct benchmark, you need to read the computed value. For example, you can compute a checksum, at least to check the results are correct/equivalent.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 3
    There's a canonical Q&A ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) that covers much of the flawed methodology you're talking about that isn't specific to the function under test (math-errno). – Peter Cordes Oct 09 '21 at 18:09
  • Thank you! I was searching for such a Q&A but failed to find something good enough. I am a bit surprised to see you on a Java-tagged question by the way ;) . – Jérôme Richard Oct 09 '21 at 19:47
  • I follow the [benchmarking] tag, mostly to spot performance questions that should have other tags, and to point out flawed methodology as the reason for surprising results. However, I've looked at the occasional Java JIT question to see how its asm compared to ahead-of-time compilers. – Peter Cordes Oct 09 '21 at 20:21
  • Thanks @JérômeRichard. I updated the code to clock time for all the loops, changed timer and also tried to use distance in code. Diff between timing improved but c is still about 10x slower. I'm guessing this is due to the flaws in time measurements? – kovac Oct 10 '21 at 07:35
  • 1
    @kovac: If you're still letting compilers / JITs optimize away an unused result, and not using `gcc -O2 -fno-math-errno`, then yeah unsurprising. You're maybe timing an empty loop in Java but not C. (And if you let them both optimize away the work, they'll be the same but it's not interesting.) `sqrtsd` has about 10x worse throughput than an empty loop on many modern x86 CPUs. For *exactly* 10x worse throughput, yes probably some CPU has a throughput of 10c for it, check https://agner.org/optimize/. Assuming you're compiling for x86-64 at all, not e.g. ARM. Or just name your CPU model. – Peter Cordes Oct 10 '21 at 07:39
  • @PeterCordes That's it! I initially tried with `-fno-math-errno` and time was effectively zero. Then I realised I was taking the sqrt of a negative number. So, I got rid of sqrt on both sides, and started printing distance (outside the timer) and all good now! Phew, I was starting to feel depressed. Thank you! Yeah, I'm using x86 intel cpu. – kovac Oct 10 '21 at 07:53
  • Even if the clocks had the same precision, the fact that the OP's benchmark is measuring time (twice) in each loop iteration is *probably* distorting the results. The chances are that some of the time measured is actually the time taken to read the clock value. So if C and Java take different times to read the clock (which seems likely) then **that** will also distort the measurements. Your solution is correct though. Hoist the time measurement out of the loop. (And fix the other issues you identified.) – Stephen C Oct 10 '21 at 08:11
  • Yep, the main issues were imprecise time measurements and more importantly, c code had a bug result in a negative value in the sqrt function which wasn't the case with java. Removing the errno during compilation helped me find that and once it's fixed, c is passing with flying colours :) – kovac Oct 10 '21 at 08:14