Julia vs MATLAB: Why is my Julia code so slow?

Question

I just started with Julia and translated my MATLAB code into Julia (basically line-by-line). I noticed that the Julia code is much slower (like 50x). The original problem is a dynamic programming problem in which I interpolate the value function -- that interpolation is where the code spents most of the time. So I tried to make a minimal example code showing the performance differences. Important things to note is that it's a spline approximation for the interpolation and that the grid is preferrably irregular, i.e. not equally spaced. The MATLAB code:

tic
spacing=1.5;
Nxx = 300; 
Naa = 350;
Nalal = 200; 
sigma = 10;
NoIter = 500;

xx=NaN(Nxx,1);
xmin = 0.01;
xmax = 400;
xx(1) = xmin;
for i=2:Nxx
    xx(i) = xx(i-1) + (xmax-xx(i-1))/((Nxx-i+1)^spacing);
end

f_U = @(c) c.^(1-sigma)/(1-sigma);  
W=NaN(Nxx,1);
W(:,1) = f_U(xx);

xprime = ones(Nalal,Naa);
for i=1:NoIter
     W_temp = interp1(xx,W(:,1),xprime,'spline');
end
toc

Elapsed time is 0.242288 seconds.

The Julia code:

using Dierckx
function performance()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         for k=1:Nalal
         W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
         end
    end
end

end

@time(performance())
4.200732 seconds (70.02 M allocations: 5.217 GB, 4.35% gc time)

I made W a 2-d array because in the original problem it is a matrix. I did some research on different interpolation packages but there weren't so much options for irregular grid and splines. MATLAB's interp1 is apparently not available.

My problem is that I was thinking if I write Julia code and it gives the same result as MATLAB, then Julia should be genereally faster. But apparently that is not the case, so that you need to pay SOME attention to your coding. I am not a programmer, of course I try to do my best, but I would like to know whether I am doing some obvious mistake here that can easily be fixed or whether it will happen (too) often that I have to pay attention to my Julia coding -- because then it might not worth it for me to learn it. In the same vein, if I can make Julia faster here (which I am pretty sure I can, e.g. the allocation looks a bit large), I could probably also make MATLAB faster. My hope with Julia was that -- for similar code -- it will run faster than MATLAB.

After some comments about the timing, I also ran this code:

using Dierckx

tic()
const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
     xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         for k=1:Nalal
         W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
         end
     end
 end
 toc()

elapsed time: 
7.336371495 seconds

Even much slower, uhm...

Another edit: eliminating one loop actually makes it faster in this case, still not comparable to MATLAB though. The code:

function performance2()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         W_temp[:,j] = evaluate(interp_spline, xprime[:,j])
    end
 end

 end

@time(performance2())
1.573347 seconds (700.04 k allocations: 620.643 MB, 1.08% gc time)

Another edit, now iterating the same number of times through the loop:

function performance3()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)
W_tempr = Array(Float64, Nalal*Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
xprimer = reshape(xprime, Nalal*Naa)

for i=1:NoIter
        W_tempr = evaluate(interp_spline, xprimer)
end

W_temp = reshape(W_tempr, Nalal, Naa)
end

tic()
performance3()
toc()

elapsed time: 
1.480349334 seconds

Still, not at all comparable with MATLAB. By the way, in my actual problem I am running the loop easily 50k times and I am accessing bigger xprime matrices although not sure whether that part makes a difference.

In MATLAB you aren't using anything fancy besides `interp1`, so my best guess is that that function is highly optimised in performance. Can't say anything about Julia though, as I don't speak that. — Adriaan, Jan 13 '16 at 11:59
I see a triple loop in the julia code, and only a single loop in matlab. That raises a flag:) Usually avoiding loops in high-level high-performance languages is a good thing. — Andras Deak -- Слава Україні, Jan 13 '16 at 12:01
Julia is not slow, maybe it is the first time you are calling all the libraries and functions? Try running the same code a second time and see if it speeds up. Julia is different from Matlab in that its compiler can compile while you are working on the code (you highlight it and you can compile highlighted sections from your code). This requires everything to be in memory, which means that it will run very slowly in the first call to any function/library. — GameOfThrows, Jan 13 '16 at 12:02
One of the obvious mistakes is the fact that your Julia code times both the compilation and the execution of the code. — , Jan 13 '16 at 12:04
It wasn't the first time that I call the library and the function, I ran it several times. I am not sure whether that's some here want but I edited the main post and ran a code that _directly_ compares with MATLAB. Concerncing the for loops, they are supposed to be very fast and users are adviced to devectorize, see here http://julialang.org/blog/2013/09/fast-numeric/ — marky2k, Jan 13 '16 at 12:43
@marky2k thanks for the link, interesting feature. I'll just join Adriaan in not making further assumptions about Julia:) — Andras Deak -- Слава Україні, Jan 13 '16 at 13:01
@marky2k Sure, but what you said doesn't necessarily apply to your inner loop. You `evaluate` the spline; if that function call (not expression, mind that) is slow, your entire loop is slow. I cannot see the code for the function (don't have access to github). Side comment, that's not how you create an anonymous function; how about `f_U = c -> c.^(1-sigma)/(1-sigma)` — , Jan 13 '16 at 13:02
hmm... the key point is the triple loop, you just need `reshape` ;) — Gnimuc, Jan 13 '16 at 13:57
also `evaluate()` is slower than `interp1()`, but not that much, checkout [this](https://gist.github.com/Gnimuc/e01cd9423bb0a0f66c24) — Gnimuc, Jan 13 '16 at 14:20
The runtimes are still not comparable because you are evaluating the spline `NoIter` times with Matlab, and `NoIter * Naa` times with Julia. Eliminating one loop made it faster because you did less work. Make the algorithms the same before drawing any conclusions. — Isaiah Norton, Jan 13 '16 at 14:51
the Julia example is implementation wise identical to the Matlab one if one replaces the 2 loops in the `Niter` with just `W_temp = evaluate(interp_spline, xprimer)` where `xprimer` is the `xprime` matrix reshaped to a vector, and you just reshape `W_temp` back to a matrix afterwards. On my machine this is still approx. 4 times slower than Matlab, so like @GnimucK. said, `evaluate()` indeed seem quite much slower than `interp1()` — luffe, Jan 13 '16 at 14:55
This is another example where Matlab teaches you anything but a programming language. — percusse, Jan 13 '16 at 15:02
The julia package [ProfileView](https://github.com/timholy/ProfileView.jl) is handy in these situations. It'll show you *exactly* which component of your Julia program is the bottleneck. Matlab has an equivalent routine, although I can't remember what it is called off the top of my head. Using both these tools should pinpoint the issue. — Colin T Bowers, Jan 13 '16 at 23:26
@marky2k Do you have any information whether your Matlab code uses multicores? (i.e., is interp1() multi-threaded by default?) — roygvib, Jan 14 '16 at 12:27
@roygvib I don't think so. I don't have the parallel toolbox .. On a side note, it seems like the MATLAB code could be made faster avoiding interp1: http://www.mathworks.com/matlabcentral/answers/44346-1d-interpolation-on-array-without-using-loops — marky2k, Jan 14 '16 at 14:31
Possible duplicate of [Why does my julia code run so slowly?](http://stackoverflow.com/questions/34615746/why-does-my-julia-code-run-so-slowly) — HarmonicaMuse, Jan 14 '16 at 14:48
@marky2k Because I have seen these pages http://www.walkingrandomly.com/?p=1894 http://www.mathworks.com/matlabcentral/answers/95958-which-matlab-functions-benefit-from-multithreaded-computation I was wondering if interp1() is using multithreads. (Now I'm trying to parallelize the Fortran code, so if possible I want to make comparison with the same number of threads...) — roygvib, Jan 14 '16 at 14:50
I would recommend you "rephrase your question" and post a [code review](http://codereview.stackexchange.com/) — Felipe Lema, Jan 14 '16 at 14:58
@roygvib it seems `interp1()` is multithreaded. If I take the average time of `500` `evaluate(...)` calls in Julia, this turns out to be almost exactly `4` times slower than the equivalent `interp1()` in Matlab, on my dual core laptop with hyperthreading. — luffe, Jan 14 '16 at 15:19
so it seems the takeaway here (ignoring the slightly messy original question) is that the Julia time is practically almost identical to the `NUMBER_OF_CORES` times the Matlab time - suggesting that `interp1()` is implemented with multithreading... ? — luffe, Jan 14 '16 at 15:36
@luffe Thanks for your info. Yeah I suspected the same, but in fact it may simply that interp1() is really highly optimized even for a single thread... so not sure :) — roygvib, Jan 14 '16 at 16:52
Have a look here - [Benchmark MATLAB & Julia for Matrix Operations](https://github.com/RoyiAvital/MatlabJuliaMatrixOperationsBenchmark). Saw this post on [Julia Tips and Tricks from Stackoverflow](https://github.com/Gnimuc/JuliaSO). — Royi, Feb 13 '17 at 14:46

roygvib · Accepted Answer · 2016-01-14T18:55:12.437

Because I'm also learning Julia, I have tried possible speed up of OP's code (for my practice!). And it seems that the bottleneck is essentially the underlying Fortran code. To verify this, I first rewrote the OP's code to a minimal form as follows:

using Dierckx

function perf()

    Nx = 300 

    xinp = Float64[ 2pi * i / Nx for i = 1:Nx ]
    yinp = sin( xinp )

    interp = Spline1D( xinp, yinp )

    Nsample = 200 * 350

    x = ones( Nsample ) * pi / 6
    y = zeros( Nsample )

    for i = 1:500
        y[:] = evaluate( interp, x )
    end

    @show y[1:3]  # The result should be 0.5 (= sin(pi/6))
end

@time perf()
@time perf()
@time perf()

where the size of the problem is kept the same, while the input x & y coordinates were changed so that the result is trivially known (0.5). On my machine, the result is

y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.886956 seconds (174.20 k allocations: 277.414 MB, 3.55% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.659290 seconds (1.56 k allocations: 269.295 MB, 0.39% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.648145 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)

From now on, I will omit y[1:3] for brevity (I have confirmed that in all cases the obtained y[1:3] is correct). If we replace evaluate() by copy!(y,x), the result becomes

  0.206723 seconds (168.26 k allocations: 10.137 MB, 10.27% gc time)
  0.023068 seconds (60 allocations: 2.198 MB)
  0.023013 seconds (60 allocations: 2.198 MB)

so essentially all the time is spent in evaluate(). Now looking at the original code of this routine, we see that it calls splev() in Fortran, which in turn calls fpbspl() (both originally from Netlib). Those routines are rather old (written in ~1990) and seem not very well optimized for current computers (for example, there are lots of IF branches and vectorization may be difficult...). Because it is not trivial how to "vectorize" the code, I instead tried parallelization with OpenMP. The modified splev() is like this, where the input points are simply divided into threads:

      subroutine splev(t,n,c,k,x,y,m,e,ier)
c  subroutine splev evaluates in a number of points x(i),i=1,2,...,m
c  a spline s(x) of degree k, given in its b-spline representation.
(... same here ...)

c  main loop for the different points.
c$omp parallel do default(shared)
c$omp.firstprivate(arg,ier,l1,l,ll,sp,h) private(i,j)
      do i = 1, m

c  fetch a new x-value arg.
        arg = x(i)
c  check if arg is in the support
        if (arg .lt. tb .or. arg .gt. te) then
            if (e .eq. 0) then
                goto 35
            else if (e .eq. 1) then
                y(i) = 0
                goto 80
            else if (e .eq. 2) then
                ier = 1
                ! goto 100        !! I skipped this error handling for simplicity.
            else if (e .eq. 3) then
                if (arg .lt. tb) then
                    arg = tb
                else
                    arg = te
                endif
            endif
        endif

c  search for knot interval t(l) <= arg < t(l+1)
 35     if ( t(l) <= arg .or. l1 == k2 ) go to 40
        l1 = l
        l = l - 1
        go to 35
  40    if ( arg < t(l1) .or. l == nk1 ) go to 50
        l = l1
        l1 = l + 1
        go to 40

c  evaluate the non-zero b-splines at arg.
  50    call fpbspl(t, n, k, arg, l, h)

c  find the value of s(x) at x=arg.
        sp = 0.0d0
        ll = l - k1

        do 60 j = 1, k1
          ll = ll + 1
          sp = sp + c(ll)*h(j)
  60    continue
        y(i) = sp

 80     continue

      enddo
c$omp end parallel do
 100  return
      end

Now rebuilding the package with gfortran -fopenmp and calling perf() above gives

$ OMP_NUM_THREADS=1 julia interp.jl
  1.911112 seconds (174.20 k allocations: 277.414 MB, 3.49% gc time)
  1.599154 seconds (1.56 k allocations: 269.295 MB, 0.41% gc time)
  1.671308 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)

$ OMP_NUM_THREADS=2 julia interp.jl
  1.308713 seconds (174.20 k allocations: 277.414 MB, 5.14% gc time)
  0.874616 seconds (1.56 k allocations: 269.295 MB, 0.46% gc time)
  0.897505 seconds (1.56 k allocations: 269.295 MB, 0.21% gc time)

$ OMP_NUM_THREADS=4 julia interp.jl
  0.749203 seconds (174.20 k allocations: 277.414 MB, 9.31% gc time)
  0.446702 seconds (1.56 k allocations: 269.295 MB, 0.93% gc time)
  0.439522 seconds (1.56 k allocations: 269.295 MB, 0.43% gc time)

$ OMP_NUM_THREADS=8 julia interp.jl
  0.478504 seconds (174.20 k allocations: 277.414 MB, 14.66% gc time)
  0.243258 seconds (1.56 k allocations: 269.295 MB, 1.81% gc time)
  0.233157 seconds (1.56 k allocations: 269.295 MB, 0.89% gc time)

$ OMP_NUM_THREADS=16 julia interp.jl
  0.379243 seconds (174.20 k allocations: 277.414 MB, 19.02% gc time)
  0.129145 seconds (1.56 k allocations: 269.295 MB, 3.49% gc time)
  0.124497 seconds (1.56 k allocations: 269.295 MB, 1.80% gc time)

# Julia: v0.4.0, Machine: Linux x86_64 (2.6GHz, Xeon2.60GHz, 16 cores)

so the scaling seems trivially good (but please let me know if I am making a big mistake about using OpenMP in this way...) If the above result is correct, it means that 8 threads are necessary with this Fortran code to match the speed of interp1() on OP's machine. But the good news is that there are probably room for improvement of the Fortran codes (even for serial run). Anyway, the OP's program (as in the final form) seems to be comparing the performance of the underlying interpolation routine, i.e., interp1() in Matlab vs splev() in Fortran.

That's a good answer. So it wasn't actually Julia and I was comparing apples with oranges (multithreading vs. single)! Good to know. Thanks for the effort! — marky2k, Jan 14 '16 at 17:28
Though I focused on the Fortran part, I really think Ismael's first EDIT part is helpful, so let's keep those points in mind... (I'm also reading the manual but the volume is pretty large!) — roygvib, Jan 14 '16 at 17:46
@roygvib that is very interesting. Since you seem pretty fluent in fortran: so perhaps since the `Dierckx` routines are several decades old and written adhering to archaic idioms - `-fPIC -O3` may not be enough for `gfortran` to squeeze out every last bit of instruction level parallelism? And I guess my timing observation was most likely a coincidence: the Matlab documentation would state if built-in functions where multithreaded - warning developers of potentially thread unsafe code? In any case it is a bit disappointing to see that Julia's go-to (?) interpolation routines could be faster. — luffe, Jan 14 '16 at 22:27
@luffe I'm not an expert in this, but have you tried any of these packages besides `Dierckx`? [GridInterpolations](https://git.io/vzcLA), [ApproXD](https://git.io/vzcLh), [Interpolations](https://git.io/vzcte), [Grid](https://github.com/timholy/Grid.jl), see also Julia issue [#2557](https://github.com/JuliaLang/julia/issues/2557). Julia is young, give it time! :D — HarmonicaMuse, Jan 15 '16 at 02:02

Julia vs MATLAB: Why is my Julia code so slow?

1 Answers1