5

I'm concerned with the speed of the following function:

def cch(tau):
    return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)

Where "cartprod" is a variable for a list that looks like this:

cartprod = np.ndarray([[0.0123,0.0123],[0.0123,0.0459],...])

The length of this list is about 25 million. Basically, I'm trying to find a significantly faster way to return a list of differences for every pair list in that np.ndarray. Is there an algorithmic way or function that's faster than np.diff? Or, is np.diff the end all be all? I'm also open to anything else.

EDIT: Thank you all for your solutions!

Matthew K
  • 189
  • 2
  • 12
  • .. timing for `np.subtract(a[:,0], a[:,1]` seems to be the same as `np.diff`. – wwii Oct 25 '18 at 02:46
  • How is `cartprod` generated? There may be some structure that can be exploited for a speedup. – user2699 Oct 25 '18 at 16:05
  • `cartprod = np.transpose([np.tile(times1, len(times2)), np.repeat(times2, len(times1))])` @user2699 where times1 and times2 are each about 5000 elements in length – Matthew K Oct 26 '18 at 04:41
  • Did either of the posted solutions work for you? – Divakar Oct 28 '18 at 04:45
  • @Divakar yea I ended up using @alexdor 's solution because it ended up being faster. For the record, to execute `cch(0)`, your solution took about `0.08` seconds, alexdor's took about `0.03`-`0.04` seconds, and my original took `0.22`-`0.23` seconds. Unfortunately, this may not be enough :( – Matthew K Oct 28 '18 at 15:41

3 Answers3

4

We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance with some help from array-slicing -

import numexpr as ne

def cch_numexpr(a, tau):
    d = {'a0':a[:,0],'a1':a[:,1]}
    return np.count_nonzero(ne.evaluate('abs(a0-a1-tau)<0.001',d))

Sample run and timings on 25M sized data -

In [83]: cartprod = np.random.rand(25000000,2)

In [84]: cch(cartprod, tau=0.5) == cch_numexpr(cartprod, tau=0.5)
Out[84]: True

In [85]: %timeit cch(cartprod, tau=0.5)
10 loops, best of 3: 150 ms per loop

In [86]: %timeit cch_numexpr(cartprod, tau=0.5)
10 loops, best of 3: 25.5 ms per loop

Around 6x speedup.

This was with 8 threads. Thus, with more number of threads available for compute, it should improve further. Related post on how to control multi-core functionality.

Divakar
  • 218,885
  • 19
  • 262
  • 358
4

I think you're hitting a wall by repeatedly returning multiple np.arrays of length ~25 million rather than np.diff being slow. I wrote an equivalent function that iterates over the array and tallies the results as it goes along. The function needs to be jitted with numba to be fast. I hope that is acceptable.

arr = np.random.rand(25000000, 2)

def cch(tau, cartprod):
    return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
%timeit cch(0.01, arr)

@jit(nopython=True)
def cch_jit(tau, cartprod):
    count = 0
    tau = -tau
    for i in range(cartprod.shape[0]):
        count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
    return count
%timeit cch_jit(0.01, arr)

produces

294 ms ± 2.82 ms 
42.7 ms ± 483 µs 

which is about ~6 times faster.

alexdor
  • 816
  • 5
  • 8
  • 3
    1) You can get an improvement of 30-40% (depending on your processor) by an assert statement `assert cartprod.shape[1]==2` before the loop. With that information the compiler can SIMD- vectorize the loop. https://github.com/numba/llvmlite/issues/270 2) Call the jitted function at least one time before the timeit call, otherwise you will measure a mixture of compilation and runtime. (Your result will depend on how many times timeit calls the function) – max9111 Oct 29 '18 at 10:02
2

Just out of curiosity I compared the solutions of @Divakar numexpr and @alexdor numba.jit. The implementation numexpr.evaluate seems to be twice as fast as using numba's jit compiler. The results are shown for 100 runs each:

np.sum:          111.07543396949768
numexpr:         12.282189846038818
JIT:             6.2505223751068115
'np.sum' returns same result as 'numexpr'
'np.sum' returns same result as 'jit'
'numexpr' returns same result as 'jit'

Script so reproduce the results:

import numpy as np
import time
import numba
import numexpr

arr = np.random.rand(25000000, 2)
runs = 100

def cch(tau, cartprod):
    return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)

def cch_ne(tau, cartprod):
    d = {'a0':cartprod[:,0],'a1':cartprod[:,1], 'tau': tau}
    count = np.count_nonzero(numexpr.evaluate('abs(a0-a1-tau)<0.001',d))
    return count

@numba.jit(nopython=True)
def cch_jit(tau, cartprod):
    count = 0
    tau = -tau
    for i in range(cartprod.shape[0]):
        count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
    return count

start = time.time()
for x in range(runs):
    x1 = cch(0.01, arr)
print('np.sum:\t\t', time.time() - start)

start = time.time()
for x in range(runs):
    x2 = cch_ne(0.01, arr)
print('numexpr:\t', time.time() - start)

x3 = cch_jit(0.01, arr)
start = time.time()
for x in range(runs):
    x3 = cch_jit(0.01, arr)
print('JIT:\t\t', time.time() - start)

if x1 == x2: print('\'np.sum\' returns same result as \'numexpr\'')
if x1 == x3: print('\'np.sum\' returns same result as \'jit\'')
if x2 == x3: print('\'numexpr\' returns same result as \'jit\'')
user69453
  • 1,279
  • 1
  • 17
  • 43
  • You have two major problems with your timings. 1) time.time() is quite inprecise. Use timit or call the function multiple times in a loop, so that the overall time measured is at least 0.5s to 1s. 2) You are measuring compilation and runtime on the jitted function. Call it once before time measurement to get the real runtime. – max9111 Oct 29 '18 at 09:52