Java best practices for vectorized computations

Question

I'm researching methods for computing expensive vector operations in Java, e.g. dot-products or multiplications between large matrices. There are a few good threads on here on this topic, like this and this.

It appears that there is no reliable way of having the JIT compile code to use CPU vector instructions (SSE2, AVX, MMX...). Moreover, high-performance linear algebra libraries (ND4J, jblas, ...) do in fact make JNI calls to BLAS/LAPACK libraries for the core routines. And I understand BLAS/LAPACK packages to be the de facto standard choices for native linear algebra computations.
On the other hand others (JAMA, ...) implement algorithms in pure Java without native calls.

My questions are:

What are the best practices here?
Is making native calls to BLAS/LAPACK actually a recommended choice? Are there other libraries worth considering?
Is the overhead of JNI calls negligible compared to the performance gain? Does anyone have experience as to where the threshold lies (e.g. how small an input should be to make JNI calls more expensive than a pure Java routine?)
How big is the portability tradeoff?

I hope this question could be of help both for those who develop their own computation routines, and for those who just want to make an educated choice between different implementations.

Insights are appreciated!

score 5 · Accepted Answer · edited May 23 '17 at 12:24

There are no clear best practices for every case. Whether you could/should use a pure Java solution (not using SIMD instructions) or (optimized with SIMD) native code through JNI depends on your particular application and specifically the size of your arrays and possible restrictions on the target system.

There could be a requirement that you are not allowed to install specific native libraries in the target system and BLAS is not already installed. In that case you simply have to use a Java library.
Pure Java libraries tend to perform better for arrays with length much smaller than 100 and at some point after that you get better performance using native libraries through JNI. As always, your mileage may vary.

Pertinent benchmarks have been performed (in random order):

These benchmarks can be confusing as they are informative. One library may be faster for some operation and slower for some other. Also keep in mind that there may be more than one implementation of BLAS available for your system. I currently have 3 installed on my system blas, atlas and openblas. Apart from choosing a Java library wrapping a BLAS implementation you also have to choose the underlying BLAS implementation.

This answer has a fairly up to date list except it doesn't mention nd4j that is rather new. Keep in mind that jeigen depends on eigen so not on BLAS.

Java best practices for vectorized computations

1 Answers1