Very large in-place numpy array operations : numba, pythran or other?

Question

tI need to perform operations on very large arrays (several millions entries), with the cumulated size of these arrays close to the available memory. I'm understanding that when doing naive operation using numpy like a=a*3+b-c**2, several temporary arrays are created and thus occupy more memory.

As I'm planning to work at the limit of the memory occupancy, I'm afraid this simple approach won't work. So I'd like to start my developments with the right approach.

I know that packages like numba or pythran can help with improving performance when manipulating arrays, but it is not clear to me if they can deal automatically or not with in-place operations, avoiding temporary objects... ?

As a simple example here's one function I'll have to use on large arrays :

def find_bins(a, indices):
    global offset, width, nstep
    i = (a-offset) *nstep/ width 
    i = np.where(i<0,0,i)
    i = np.where(i>=nstep,nstep, i)
    indices[:] = i.astype(int)

So something that mixes arithmetic operations and calls to numpy functions.

How easy would it be to write such functions using numba or pythran (or something else?) ? What would be the pros and cons in each case ?

Thanks for any hint !

ps: I know about numexpr, but I'm not sure it is convenient or well adapted to functions more complex than a single arithmetic expression ?

I thought solutions such as Numba, Cython (not familiar with Pythran) was for speeding up calculations. However, for working with large dataset there is Dask as described by [How to handle large datasets in Python with Pandas and Dask](https://towardsdatascience.com/how-to-handle-large-datasets-in-python-with-pandas-and-dask-34f43a897d55) that works on data in chunks. — DarrylG, Jun 01 '20 at 14:41

score 1 · Answer 1 · answered Jun 24 '20 at 05:12

1

using numexpr. For example:

import numexpr
numexpr.evaluate("a+b*c", out=a)

this could help you to avoid the tmp variables and you could refer to High Performance Python, M.G, I.O.

answered Jun 24 '20 at 05:12

Plus Han

11
2

score 1 · Answer 2 · answered Sep 09 '20 at 07:51

Pythran avoids many temporary arrays by design. For the simple expression you're pointing at, that would be


#pythran export find_bins(float[], int[], float, float, int)
import numpy as np
def find_bins(a, indices, offset, width, nstep):
    i = (a-offset) *nstep/ width #
    i = np.where(i<0,0,i)
    i = np.where(i>=nstep,nstep, i)
    indices[:] = i.astype(int)

This both avoids temporary and speeds-up computation.

Not that you should use np.clip function here, it's supported by Pythran as well.

Very large in-place numpy array operations : numba, pythran or other?

2 Answers2