16

All the code was run on the same machine on linux.

In python:

import numpy as np
drr = abs(np.random.randn(100000,50))
%timeit np.log2(drr)

10 loops, best of 3: 77.9 ms per loop

In C++ (compiled with g++ -o log ./log.cpp -std=c++11 -O3):

#include <iostream>
#include <iomanip>
#include <string>
#include <map>
#include <random>
#include <ctime>
int main()
{
std::mt19937 e2(0);
std::normal_distribution<> dist(0, 1);
const int n_seq = 100000;
const int l_seq = 50;
static double x[n_seq][l_seq];
for (int n = 0;n < n_seq; ++n) {
  for (int k = 0; k < l_seq; ++k) {
    x[n][k] = abs(dist(e2));
    if(x[n][k] <= 0)
      x[n][k] = 0.1;
    }
  }
 clock_t begin = clock();

 for (int n = 0; n < n_seq; ++n) {
   for (int k = 0; k < l_seq; ++k) {
     x[n][k] = std::log2(x[n][k]);
       }
  }
  clock_t end = clock();

Runs in 60 ms

In MATLAB:

abr = abs(randn(100000,50));
tic;abr=log2(abr);toc

Elapsed time is 7.8 ms.

I can understand the speed difference between C++ and numpy, but MATLAB beats everything. I've come across http://fastapprox.googlecode.com/svn/trunk/fastapprox/src/fastonebigheader.h but this does only float, not double, and I'm not sure how to convert it to double.

I also tried this: http://hackage.haskell.org/package/approximate-0.2.2.1/src/cbits/fast.c which has fast log functions, and when compiled as a numpy ufunc, runs in 20 ms, which is great, but the loss in accuracy is significant.

Any ideas on how to achieve the magical log2 speed that MATLAB gets?

UPDATE

Thank you all for comments, that was very fast and very helpful! Indeed, the answer is parallelisation, i.e. spreading the load on several threads. Following @morningsun suggestion,

%timeit numexpr.evaluate('log(drr)')

gives 5.6 ms, which is on par with MATLAB, thank you! numexpr is MKL enabled

Fermion Portal
  • 956
  • 1
  • 7
  • 14
  • 4
    Vectorisation and parallelisation. – IKavanagh Nov 19 '15 at 09:37
  • 1
    This is a typical [SIMD](https://en.wikipedia.org/wiki/SIMD) scenario. Explore vectorization technics on C++ code first. For example, try [OpenMP](http://openmp.org/wp/). – Fei Gao Nov 19 '15 at 09:41
  • 1
    The C++ compiler can't unloop the log2() calls, so it spends a lot of time keeping track of the loop indexes. And as IKavanagh says, matlab parallelizes the computation. You could easily do it with [OpenMP](http://openmp.org/wp/). – YSC Nov 19 '15 at 09:41
  • 1
    Check this question http://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication I guess the MAGMA and BLAST packs of MATLAB are the magic words. This question deals with the loops in C++ and Java, whilst MATLAB vectorises everything under the hood – Adriaan Nov 19 '15 at 09:44
  • 6
    The Python module [numexpr](https://github.com/pydata/numexpr) would also be interesting to benchmark. If you can get MKL (VML) enabled it'll do SIMD and threading on the fly. It doesn't do `log2` directly, so use `log(a)/log(2.0)` –  Nov 19 '15 at 10:12
  • how about using cuda / openCL – Jules G.M. Nov 19 '15 at 23:37
  • 1
    One thing - if you have to deal with matrices which are huuuge and won't fit into RAM in dense format, **numexpr** may not be a solution to your problems. **numexp** is not compatible with the scipy sparse matrix format (had an issue with that - http://stackoverflow.com/questions/33824617/numexpr-doesnt-recognize-float-type-sparse-matrix) – El Brutale Nov 20 '15 at 15:23
  • 2
    @ElBrutale Since taking the log is an elementwise operation you only need the values stored in the sparse array, which you can easily access via its `.data` attribute. – ali_m Nov 20 '15 at 19:47
  • @ali_m I agree, good point. However, for similar, common arithmetic operations which are not element-wise, this library just won't work. Posted it in case people need similar, but non-element-wise operations. – El Brutale Nov 23 '15 at 15:07

1 Answers1

5

Note that ALL below are float32, not double precision.

UPDATE: I've ditched gcc completely in favour of Intel's icc. It makes ALL the difference when performance is critical and when you don't have time to fine-tune your "compiler hints" to enforce gcc vectorization (see, e.g. here)

log_omp.c,

GCC: gcc -o log_omp.so -fopenmp log_omp.c -lm -O3 -fPIC -shared -std=c99

ICC: icc -o log_omp.so -openmp loge_omp.c -lm -O3 -fPIC -shared -std=c99 -vec-report1 -xAVX -I/opt/intel/composer/mkl/include

#include <math.h>
#include "omp.h"
#include "mkl_vml.h"

#define restrict __restrict

inline void log_omp(int m, float * restrict a, float * restrict c);

void log_omp(int m, float * restrict a, float * restrict c)
{
   int i;
#pragma omp parallel for default(none) shared(m,a,c) private(i)
   for (i=0; i<m; i++) {
      a[i] = log(c[i]);
   }
}

// VML / icc only:
void log_VML(int m, float * restrict a, float * restrict c)
{
   int i;
   int split_to = 14;
   int iter = m / split_to;
   int additional = m % split_to;

//   vsLn(m, c, a);
#pragma omp parallel for default(none) shared(m,a,c, additional, iter) private(i) num_threads(split_to)
   for (i=0;i < (m-additional); i+=iter)
     vsLog10(iter,c+i,a+i);
     //vmsLn(iter,c+i,a+i, VML_HA);

   if (additional > 0)
     vsLog10(additional, c+m-additional, a+m-additional);
     //vmsLn(additional, c+m-additional, a+m-additional, VML_HA);
}

in python:

from ctypes import CDLL, c_int, c_void_p
def log_omp(xs, out):
    lib = CDLL('./log_omp.so')
    lib.log_omp.argtypes = [c_int, np.ctypeslib.ndpointer(dtype=np.float32), np.ctypeslib.ndpointer(dtype=np.float32)]
    lib.log_omp.restype  = c_void_p
    n = xs.shape[0]
    out = np.empty(n, np.float32)
    lib.log_omp(n, out, xs)
    return out

Cython code (in ipython notebook, hence the %% magic):

%%cython --compile-args=-fopenmp --link-args=-fopenmp
import  numpy as np
cimport numpy as np
from libc.math cimport log

from cython.parallel cimport prange
import cython

@cython.boundscheck(False)
def cylog(np.ndarray[np.float32_t, ndim=1] a not None,
        np.ndarray[np.float32_t, ndim=1] out=None):
    if out is None:
        out = np.empty((a.shape[0]), dtype=a.dtype)
    cdef Py_ssize_t i
    with nogil:
        for i in prange(a.shape[0]):
            out[i] = log(a[i])
    return out

Timings:

numexpr.detect_number_of_cores() // 2
28

%env OMP_NUM_THREADS=28
x = np.abs(np.random.randn(50000000).astype('float32'))
y = x.copy()

# GCC
%timeit log_omp(x, y)
10 loops, best of 3: 21.6 ms per loop
# ICC
%timeit log_omp(x, y)
100 loops, best of 3: 9.6 ms per loop
%timeit log_VML(x, y)
100 loops, best of 3: 10 ms per loop

%timeit cylog(x, out=y)
10 loops, best of 3: 21.7 ms per loop

numexpr.set_num_threads(28)
%timeit out = numexpr.evaluate('log(x)')
100 loops, best of 3: 13 ms per loop

So numexpr seems to be doing a better job than (poorly) compiled gcc code, but icc wins.

Some resources I found useful and shamefully used code from:

http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html

https://gist.github.com/zed/2051661

Fermion Portal
  • 956
  • 1
  • 7
  • 14