Julia performance compared to Python+Numba LLVM/JIT-compiled code

Question

The performance benchmarks for Julia I have seen so far, such as at http://julialang.org/, compare Julia to pure Python or Python+NumPy. Unlike NumPy, SciPy uses the BLAS and LAPACK libraries, where we get an optimum multi-threaded SIMD implementation. If we assume that Julia and Python performance are the same when calling BLAS and LAPACK functions (under the hood), how does Julia performance compare to CPython when using Numba or NumbaPro for code that doesn't call BLAS or LAPACK functions?

One thing I notice is that Julia is using LLVM v3.3, while Numba uses llvmlite, which is built on LLVM v3.5. Does Julia's old LLVM prevent an optimum SIMD implementation on newer architectures, such as Intel Haswell (AVX2 instructions)?

I am interested in performance comparisons for both spaghetti code and small DSP loops to handle very large vectors. The latter is more efficiently handled by the CPU than the GPU for me due to the overhead of moving data in and out of the GPU device memory. I am only interested in performance on a single Intel Core-i7 CPU, so cluster performance is not important to me. Of particular interest to me is the ease and success with creating parallelized implementations of DSP functions.

A second part of this question is a comparison of Numba to NumbaPro (ignoring the MKL BLAS). Is NumbaPro's target="parallel" really needed, given the new nogil argument for the @jit decorator in Numba?

@user3666197 flaming responders and espousing conspiracy theories about SO responders engenders little sympathy for your cause. your answer is verbose and difficult to understand. your subsequent comments insult the goodwill of Julia users on SO who volunteer their time to answer questions. if you have constructive criticism about Julia performance timings versus Python/Numba, then consider posting a separate question on SO or a Julia user list. this question by hiccup is not the appropriate avenue. — Kevin L. Keys, Jan 06 '17 at 02:19
Dear Kevin L. Keys, thx for a response to deleted comment, **Fact#1** a practice to delete a post is called censorship, irrespective of the motivation for executing that kind of power. **Fact#2** citation of the unfair timing practice, documented on LuaJIT discussion, is citation, not an opinion, the less any insult. **Fact#3** constructive proposal was presented since the first post of the Answer, in as a **reproducible MCVE**, to allow running a **coherent**-experiment, whereas later comments have brought but incoherent-test factor (+new light from a documented principal Lua incident). — user3666197, Jan 06 '17 at 05:59
The beauty and the power of a scientific critical thinking is in it's ability to repeat tests to confirm or invalidate a theory, model or test. If the hiccup has asked about numba-LLVM/JIT-compiled performance and the published statement says a GIL-stepped interpreted code runs 22x slower, the experiment proposed below tested the zone of speed expectations for coherent-experiment (ought be run&updated on the side of the language maintainers+with a corrected fair timing method). **Having sent a research proposal in this direction to prof. Sanders** (now, MIT Julia Lab) **it is fully doable.** — user3666197, Jan 06 '17 at 06:18
Last, but not least, given your argumentation strives to protect *(cit.:) "... the goodwill of Julia users on SO who volunteer their time to answer questions"*, let me request you to **kindly pay the very same respect** for my volunteered time to answer **@hiccup**-s question and good will to communicate the core merit, while being exposed to repetitive censorship and destructive down-voting hystery. If one considers the Answer below to be difficult to understand and/or verbose, it strived to cite facts in a repeatable MCVE -experiment, to allow those who can+want to re-run it to get results. — user3666197, Jan 06 '17 at 06:29
Given the fact that several previous comments on caching-hierarchy influence on tests were deleted & with a hope the censors would not delete a link to a similarly motivated Jean-François Puget's ( IBM France ) thorough experimentation to re-test Sebastian F. Walter's tests, but on a realistic sized matrices (where different caching strategies do show their edge)>>> **https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en** where SciPy+LAPACK show their remarkable edge on matrix sizes above 1000x1000. — user3666197, Jan 07 '17 at 05:22

score 9 · Answer 1 · edited Jan 05 '17 at 17:31

This is a very broad question. Regarding the benchmark requests, you may be best off running a few small benchmarks yourself matching your own needs. To answer one of the questions:

One thing I notice is that Julia is using LLVM v3.3, while Numba uses llvmlite, which is built on LLVM v3.5. Does Julia's old LLVM prevent an optimum SIMD implementation on newer architectures, such as Intel Haswell (AVX2 instructions)?

[2017/01+: The information below no longer applies to current Julia releases]

~~Julia does turn off avx2 with LLVM 3.3 because there were some deep bugs on Haswell.~~

Julia is built with LLVM 3.3 for the current releases and nightlies, but you can build with 3.5, 3.6, and usually svn trunk (if we haven't yet updated for some API change on a given day, please file an issue). To do so, set LLVM_VER=svn (for example) in Make.user and then proceed to follow the build instructions.

score 5 · Answer 2 · answered Apr 10 '15 at 00:02

5

See here (section 4) for some peer-reviewed benchmarks which I personally worked on. The comparison was between Julia and PyPy.

answered Apr 10 '15 at 00:02

mlubin

943
5
10

3

I excluded PyPy from consideration because it doesn't support SciPy, matplotlib, 64-bit Windows+Python & Python 3.3+. In 2013, when the referenced paper was written, PyPy also didn't support BLAS & LAPACK. For scientific applications, I prefer to compare to CPython+SciPy+LLVM (Numba or NumbaPro). – hiccup Apr 13 '15 at 18:37

score -2 · Answer 3 · edited Apr 28 '17 at 21:36

-2

(Comparing uncomparable is always a dual-sided sword.

The below is presented in a fair belief that LLVM / JIT-powered code benchmarks ought be compared to some other LLVM / JIT-powered alternatives should any derived conclusion shall serve as a basis for reasonably supported decisions.)

Intro :^{( numba stuff and [us] results come a bit lower down the page )}

With all due respect, julia-lang official site presents a tabulated set of performance testing, where two categories of facts are stated. The first, related to how the performance test was performed ( julia, using LLVM compiled code-execution v/s python, remaining a GIL-stepped, interpreted code-execution ). The second, how much longer do other languages take to complete the same "benchmark-task", using C-compiled code execution as a relative unit of time = 1.0

The chapter header, above a Table with results, says (cit.:)

High-Performance JIT Compiler
Julia’s LLVM-based just-in-time (JIT) compiler combined with the language’s design allow it to approach and often match the performance of C.

I thought a bit more rigorous to compare apples to apples and took just one of the "benchmark-task"-s, called the pi-sum.

This was the second worst time for interpreted python, presented to have run 21.99 times slower than a LLVM/JIT-compiled julia-code or C-compiled alternative.

So the small experimentation story started.

`@numba.jit( JulSUM, nogil = True )`:

Let's start to compare apples to apples. If julia code is reported to run 22x faster, let's first measure a plain interpreted python code run.

>>> def JulSUM():
...     sum = 0.
...     j   = 0
...     while j < 500:
...           j   += 1
...           sum  = 0.
...           k    = 0
...           while k < 10000:
...                 k   += 1
...                 sum += 1. / ( k * k )
...     return sum
...
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();_=JulSUM();aClk.stop()
1271963L
1270088L
1279277L
1277371L
1279390L
1274231L

So, the core of the pi-sum runs about 1.27x.xxx [us] ~ about 1.27~1.28 [s]

Given the table row for pi-sum in language presentation on julia-lang website, the LLVM/JIT-powered julia code execution ought run about 22x faster, i.e. under ~ 57.92 [ms]

>>> 1274231 / 22
57919

So, let's convert oranges to apples, using numba.jit ( v24.0 )

>>> import numba
>>> JIT_JulSUM = numba.jit( JulSUM )
>>> aClk.start();_=JIT_JulSUM();aClk.stop()
1175206L
>>> aClk.start();_=JIT_JulSUM();aClk.stop()
35512L
37193L
37312L
35756L
34710L

So, after JIT-compiler has made it's job, numba-LLVM'ed python exhibits benchmark times somewhere about 34.7 ~ 37.3 [ms]

Can we go farther?

Oh sure, we have not done much of the numba tweaking yet, while the code example is so trivial, not much surprising advances are expected to appear down the road.

First, let's remove the here unnecessary GIL-stepping:

>>> JIT_NOGIL_JulSUM = numba.jit( JulSUM, nogil = True )
>>> aClk.start();_=JIT_NOGIL_JulSUM();aClk.stop()
85795L
>>> aClk.start();_=JIT_NOGIL_JulSUM();aClk.stop()
35526L
35509L
34720L
35906L
35506L

nogil=True
does not bring the execution much farther,
but still shaves a few [ms] more, driving all results under ~ 35.9 [ms]

>>> JIT_NOGIL_NOPYTHON_JulSUM = numba.jit( JulSUM, nogil = True, nopython = True )
>>> aClk.start();_=JIT_NOGIL_NOPYTHON_JulSUM();aClk.stop()
84429L
>>> aClk.start();_=JIT_NOGIL_NOPYTHON_JulSUM();aClk.stop()
35779L
35753L
35515L
35758L
35585L
35859L

nopython=True
does just a final polishing touch
to get all results consistently under ~ 35.86 [ms] ( vs. ~57.92 [ms] for LLVM/JIT-julia )

Epilogue on DSP processing:

For the sake of the OP question about additional benefits for accelerated DSP-processing,
one may try and test numba + Intel Python ( via Anaconda ), where Intel has opened a new horizon in binaries, optimised for IA64-processor internalities, thus the code-execution may enjoy additional CPU-bound tricks, based on Intel knowledge of ILP4, vectorisation and branch-prediction details their own CPU-s exhibit in runtime. Worth a test to compare this ( plus one may enjoy their non-destructive code-analysis tool integrated into VisualStudio, where in-vitro code-execution hot-spots could be analysed in real-time -- a thing a DSP engineer would just love, wouldn't he/she?

edited Apr 28 '17 at 21:36

halfer

19,824
17
99
186

answered Jan 01 '17 at 20:14

user3666197

1
6
50
92

2

Did you actually run the Julia code on your own machine? Which exact code? What was the timing? I suggest multiplying the workload by a factor of at least a hundred to have a fairer comparison. – David P. Sanders Jan 02 '17 at 22:47
( Yes, the 500x repeated 10k loop could be run much more times, however I kept the cited [tag:julia-lang] site methodology 1:1 ). – user3666197 Jan 03 '17 at 02:55
**Which exact code?** well, as noted many times above, the **pi_sum** test. – user3666197 Jan 03 '17 at 03:24
**What was the timing?** well, re-used the ZeroMQ `zmq.Stopwatch()` instance, with exhibits better than 1L [us] resolution in time. For all the **[tag:julia-lang]** published relative times of respective benchmark measurements, kindly ref. to www.julialang.org ( as hyperlinked above, in the Answer ). – user3666197 Jan 03 '17 at 03:28
It's not hard to imagine reasons that performance ratios might be different between different systems. Reporting any time (let alone a time with four significant digits) for Julia when you didn't actually run it on the same system is disingenuous. But it doesn't matter all that much because you're right in some regard — Numba *should* be roughly on par with Julia. It's just not the same as Python. – mbauman Jan 03 '17 at 17:18
Sorry to respond to objected *(cit.:)* "disingenuous" citations of figures and texts directly taken from **[tag:julia-lang]** official-site. The whole post is not about julia-language, the less about it's relative performance. The post is about comparing apples to apples - i.e. LLVM/JIT'd code-experiment to another LLVM/JIT'd code-experiment, having some common ground ( the given `pi_sum` algorithm, again, not my idea ). **Pitty most of responses are flames-only** ( hidden or not ) **whereas the ultimately fair move would be to update official site to compare against `numba`-LLVM/JIT'd runs**. – user3666197 Jan 03 '17 at 18:03
This was **exactly what hiccup has asked about** to discuss here, hasn't he? *Btw, some of my comments on rigorous testing methodology, that would exclude skewness arising for large-test-batches from IA64 cache-related unrealistic "improvements" were even deleted from here. Additional levels of censorship in the Merit-focused Community?* Just wanted to test a code published to run 22x slower and compare the resulting times. **Resulting ~36 [ms] against 58 [ms] is a significantly representative difference to revise the official website *( or to censor all oponents having different opinion )*.** – user3666197 Jan 03 '17 at 18:11
2

Comparing Julia to numba is both sensible and interesting. But in order to do so, the codes must both obviously be run on the same machine. – David P. Sanders Jan 03 '17 at 23:01
Yes, David, the ideal scenario would be to make the the official language maintainers to extend the said experiment with numba code executed on identical Xeon machine as other results for Table were collected, but again, this is not what @hiccup has asked about. **The 55[ms] and 1210.3283405..[ms] published for Xeon is as coherent-test as to use the published 1/22 of 1277[ms](measured on i5), setting a fair threshold of 58[ms] for the same (i5) machine to run the numba-LLVM/JIT-compiled part as a coherent-experiment on (i5). The Xeon, i5, any other machine is fair to expect ~0.6x factor** – user3666197 Jan 04 '17 at 04:19
This is taught as an art of indirect setup for a coherent experiment in the entry course of physics. Sure, some architecture differences will provide larger cache areas, taller or shorter cache-hierarchies L3/L2/L1, different NUMA, other VLIW, ILPx, CISC/RISC ( to name a few ) for both LLVM/JIT-compiled code-execution, but both Code-Under-Test representations still run on the grounds of coherent experiment that compares apples to apples ( JIT/JIT ), not apples to oranges ( JIT/GIL ). The principles of coherent-test were preserved and provided sufficient answer to @hiccup question of interest. – user3666197 Jan 04 '17 at 04:36
2

For what it's worth, Julia 0.5 is twice as fast as numba on my machine for this particular micro-benchmark. – David P. Sanders Jan 04 '17 at 12:38
@DavidP.Sanders Was it on the `numba.jit( JulSUM, nogil = True )` code? Was the Julia 0.5 also used on all the other microbenchmarks published on the language official site? What python / numba / zmq versions were used on non-julia side of the test? The factor on another test-bench ( other version, other machine ) is not as interesting per-se, but the coherent-test methodology and the result interpretation are. **Would you mind to provide the both test documentation with screen-copies as your answer to @hiccup question of interest?** – user3666197 Jan 04 '17 at 13:43
1

here's an [example](https://github.com/JuliaLang/julia/issues/14222) of an alternative approach, where perhaps GitHub is superior to StackOverflow for extended discussions and analysis. – daycaster Jan 05 '17 at 10:35
Well, thanks **daycaster** for showing another dimension of uncoherent-testing observed. My initial and primary interest was in performing just a fair test comparison of LLVM/JIT-to-LLVM/JIT code-executions. On Python/numba side I used to always use a user-domain clocking ( using zmq ) and did not thought about julia-side clocking. Mea culpa, thanks again for keeping your eyes open and posting the link. – user3666197 Jan 05 '17 at 17:15

Julia performance compared to Python+Numba LLVM/JIT-compiled code

3 Answers3

Intro : ( numba stuff and [us] results come a bit lower down the page )

@numba.jit( JulSUM, nogil = True ):

Can we go farther?

Epilogue on DSP processing:

Intro :^{( numba stuff and [us] results come a bit lower down the page )}

`@numba.jit( JulSUM, nogil = True )`: