Why is my Julia implementation for computing Euclidean distances in 3D faster than my C implementation

Question

I am comparing the time it takes Julia to compute the Euclidean distances between two sets of points in 3D space against an equivalent implementation in C. I was very surprised to observe that (for this particular case and my particular implementations) Julia is 22% faster than C. When I also included @fastmath in the Julia version, it would be even 83% faster than C.

This leads to my question: why? Either Julia is more amazing than I originally thought or I am doing something very inefficient in C. I am betting my money on the latter.

Some particulars about the implementation:

In Julia I use 2D arrays of Float64.
In C I use dynamically allocated 1D arrays of double.
In C I use the sqrt function from math.h.
The computations are very fast, therefore I compute them a 1000 times to avoid comparing on the micro/millisecond level.

Some particulars about the compilation:

Compiler: gcc 5.4.0
Optimisation flags: -O3 -ffast-math

Timings:

Julia (without @fastmath): 90 s
Julia (with @fastmath): 20 s
C: 116 s
I use the bash command time for the timings
- $ time ./particleDistance.jl (with shebang in file)
- $ time ./particleDistance

particleDistance.jl

#!/usr/local/bin/julia

function distance!(x::Array{Float64, 2}, y::Array{Float64, 2}, r::Array{Float64, 2})
    nx = size(x, 1)
    ny = size(y, 1)

    for k = 1:1000

        for j = 1:ny

            @fastmath for i = 1:nx
                @inbounds dx = y[j, 1] - x[i, 1]
                @inbounds dy = y[j, 2] - x[i, 2]
                @inbounds dz = y[j, 3] - x[i, 3]

                rSq = dx*dx + dy*dy + dz*dz

                @inbounds r[i, j] = sqrt(rSq)
            end

        end

    end

end

function main()
    n = 4096
    m = 4096

    x = rand(n, 3)
    y = rand(m, 3)
    r = zeros(n, m)

    distance!(x, y, r)

    println("r[n, m] = $(r[n, m])")
end

main()

particleDistance.c

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

void distance(int n, int m, double* x, double* y, double* r)
{
    int i, j, I, J;
    double dx, dy, dz, rSq;

    for (int k = 0; k < 1000; k++)
    {
        for (j = 0; j < m; j++)
        {
            J = 3*j;

            for (i = 0; i < n; i++)
            {
                I = 3*i;

                dx = y[J] - x[I];
                dy = y[J+1] - x[I+1];
                dz = y[J+2] - x[I+2];

                rSq = dx*dx + dy*dy + dz*dz;

                r[j*n+i] = sqrt(rSq);
            }
        }
    }
}

int main()
{
    int i;
    int n = 4096;
    int m = 4096;

    double *x, *y, *r;

    size_t xbytes = 3*n*sizeof(double);
    size_t ybytes = 3*m*sizeof(double);

    x = (double*) malloc(xbytes);
    y = (double*) malloc(ybytes);
    r = (double*) malloc(xbytes*ybytes/9);

    for (i = 0; i < 3*n; i++)
    {
        x[i] = (double) rand()/RAND_MAX*2.0-1.0;
    }

    for (i = 0; i < 3*m; i++)
    {
        y[i] = (double) rand()/RAND_MAX*2.0-1.0;
    }

    distance(n, m, x, y, r);

    printf("r[n*m-1] = %f\n", r[n*m-1]);

    free(x);
    free(y);
    free(r);

    return 0;
}

Makefile

all: particleDistance.c
    gcc -o particleDistance particleDistance.c -O3 -ffast-math -lm

how and where did you measure your timings, I think the function `rand()` has high costs. — Tom Kuschel, Jul 16 '17 at 06:52
I timed from the command line, so I am timing the entire program, not just a part of it. So my timings do include the time spend on `rand()` (for both Julia and C). I also did the timing after commenting out the `rand()` part. The timing is then somewhat less, but Julia is still considerably faster. — Numaerius, Jul 16 '17 at 07:07
@PeterJ, sorry for my lack of knowledge, but is *soft math* and *fpu*? — Numaerius, Jul 16 '17 at 07:27
In the c code, there are more instructions too (apart from the sqrt()-function. The calculation of `r[j*n+i]` and also the line `I = 3*i;` costs because of the 1000*4k*4k = 16 Billion calculations. You can test this behaviour when removing the sqrt with a function that do nothing. — Tom Kuschel, Jul 16 '17 at 07:27
How the timing changes if you replace `r[j*n+i] = sqrt(rSq);` with `r[j*n+i] = rSq;`? Please show us the assembler code the compiler produced. — user5329483, Jul 16 '17 at 08:15
@mtgoncalves https://stackoverflow.com/questions/3321468/whats-the-difference-between-hard-and-soft-floating-point-numbers — 0___________, Jul 16 '17 at 09:38
@user5329483 if I remove the call to `sqrt()` in C, the timing drops from 116 to 67 s. Still way slower than Julia. @roygvib Nice catch! Unfortunately, it did nothing for the timing. — Numaerius, Jul 16 '17 at 10:04
I do not know about Julia data model but there is one performance problem in your C code. C compiler can not optimize access to variables `y[J]`, `y[J+1]` and `y[J+2]` by taking those values before the inner loop and using them repeatedly. This is because theoretically, `r` can be a pointer overlapping `y` and when you assign to `r[j*n+i]` you can change `y[J]` for the next loop iteration. To solve it, you can take those values into intermediary variables (like `y0=y[J]; y1=y[J+1]; y2=y[J+2];` before the loop and use `y0,y1,y2` inside the loop or use the attribute `restrict` on `r`. — Marian, Jul 16 '17 at 12:34
Not sure why this is surprising. Type-stable Julia is essentially clang-compiled C. Did you try compiling with clang and see if this is just a case where it optimizes better? — Chris Rackauckas, Jul 16 '17 at 15:22
Interesting. Many times it does. In that case it sounds like Julia is just generating better code than you, but with enough work you can find out how. Check @code_llvm for the IR — Chris Rackauckas, Jul 16 '17 at 18:22
@Marian using `y[0]`, 'y[1]' and 'y[2]' and `restrict` did not do anything for the timing:( — Numaerius, Jul 16 '17 at 18:47
In general, this code in Julia should be of similar performance to written C code. However, measuring performance from the command line for dynamic languages like Julia is incorrect, you are primarily measuring the compilation time. Having said that, I think the primary cause of C being slow is that you are calling `rand()` within a loop in C, while Julia uses a much more efficient method that generated multiple random numbers in one go. I would guess that if you change the Julia code to call a scalar `rand` in a loop, it will become much slower. — aviks, Jul 16 '17 at 22:48
@aviks Even when I commented out the `rand()` part in the C code, the C code would still be slower than the Julia code **with** the `rand()` still it in. So I doubt that the slowness is due to `rand()`. — Numaerius, Jul 17 '17 at 22:06

Diego Coelho · Answer 1 · 2017-07-16T07:47:23.047

0

Maybe it should be a comment, but the point is that Julia is indeed pretty optimized. In the Julia web page you can see that it can beat C in some cases (mandel).

I see that you are using -ffast-math in your compilation. But, maybe you could do some optimizations in your code (although nowadays compilers are pretty smart and this might not solve the issue).

Instead of using int for your indexes, try to use unsigned int, this allows you to maybe try the following thing;
Instead of multiply by 3, if you use an unsigned you can do a shift and add. This can save some computation time;
In accessing the elements like x[J], maybe try using pointers directly and access the elements in a sequential manner like x+=3 (?);
Instead of int n and int m, try to set them as macros. If they are known in advance, you can take advantage of that.
Does the malloc make difference in this case? If n and m are known, fixed size arrays would reduce the time spent for the OS allocate memory.

There might be a few other things, but Julia is pretty optimized with real time compilation, so everything that is constant and is known in advance is used in favor of it. I have tried Julia with no regrets.

edited Jul 16 '17 at 07:47

answered Jul 16 '17 at 07:29

Diego Coelho

21
7

2

Almost all of the optimization ideas presented here seriously worsen the code and have no impact on performance with a remotely modern compiler. – R.. GitHub STOP HELPING ICE Jul 16 '17 at 12:34
Did you have verified that on an actual machine? So, which suggestions would you give? As I said, that suggestions might not solve the problem since modern compilers are very optimized. – Diego Coelho Jul 17 '17 at 05:41
You don't need an actual machine to verify that `x*9` compiles to the same thing as `(x<<3)+x`. Godbolt suffices. – R.. GitHub STOP HELPING ICE Jul 17 '17 at 14:49
1

But there is a difference between not improving to worsen. Which of my suggestions (seriously) worsen the code as you said? – Diego Coelho Jul 17 '17 at 15:16
Worsening is making it less readable and less obvious what the purpose is. – R.. GitHub STOP HELPING ICE Jul 18 '17 at 01:38

score 0 · Answer 2 · answered Aug 10 '17 at 20:39

Your index calculation in C is rather slow

Try something like the following (I did not compiled it, it may have still errors, just too visualize the idea):

void distance(int n, int m, double* x, double* y, double* r)
{
int i, j;
double dx, dy, dz, rSq;
double* X, *Y, *R;


for (int k = 0; k < 1000; k++)
{
    R = r;
    Y = y;
    for (j = 0; j < m; j++)
    {
        X = x;

        for (i = 0; i < n; i++)
        {
            dx = Y[0] - *X++;
            dy = Y[1] - *X++;
            dz = Y[2] - *X++;

            rSq = dx*dx + dy*dy + dz*dz;

            *R++ = sqrt(rSq);
        }
        Y += 3;
    }
}
}

Alternatively you could try, it might be a little bit faster (one increment instead of 3)

            dx = Y[0] - X[0];
            dy = Y[1] - X[1];
            dz = Y[2] - X[2];
            X+=3;

Y[x] is the same as *(Y+x).

Good luck

Why is my Julia implementation for computing Euclidean distances in 3D faster than my C implementation

2 Answers2