I'm currently working on the Python implementation of a coupled HMM which involves the element-wise multiplication, dot product and sum of ndarrays of dimension (50,50,40,40,40) and ndarrays of dimension (40,40,40,40). That is to say, very large... The second ndarray is more or less sparse with about 70% of zeros.
I've been using numpy.einsum until now, and it has given me quite slow results (the algorithm takes about 3h to run). Now the problem is that I need to optimize some of the parameters for my HMM, which means that I will have to make at least 10000 runs, and thus I would need a speed increase of a factor at least 1000 in order to keep things reasonable.
I've been looking around to find the best way to do large arrays operations in Python, and now I'm quite lost with all that I have read. So I have several questions. Before asking them, I just want to specify that I'm using the last anaconda distribution with Python 3 on a machine with OSX, an intel processor and a nvidia GPU. Also, I believe that I can flatten my ndarrays to simplify my problem to a simple matrix-matrix problem.
1)It seems that BLAS/MKL libraries can give pretty good increases. It also seems that MKL is natively linked to Python when using Ananaconda with OSX. Therefore, have I been using MKL until now, without knowing it? np.config.show() gives me stuff like that:
openblas_lapack_info:
NOT AVAILABLE
blas_opt_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
lapack_opt_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
lapack_mkl_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
blas_mkl_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
mkl_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
Does that really gives a speed increase with the einsum function? If not, how can I manage to get access to, for instance, the gemm function of MKL from Python?
2) Would any of these Python libraries be useful for my problem: Theanos, Numba, Cupy? Or simply Anaconda Accelerate? Would cuBlas, from Anaconda Accelerate, be the best option for me (namely that I'm wiling to use single or half float precision)?
3)Would that be useful recoding my algorithm in, for instance, C or C++?
4)I tried implementing my algorithm using sparse matrices (scipy.sparse.csc_matrix) but it made my program a lot slower. Was that expectable or did I make a mistake?
I know this makes many questions, but this is the first time I'm confronted to this kind of problem and the internet is not really clear on that...
Thanks a lot!