17

I have a large two dimensional array arr which I would like to bin over the second axis using numpy. Because np.histogram flattens the array I'm currently using a for loop:

import numpy as np

arr = np.random.randn(100, 100)

nbins = 10
binned = np.empty((arr.shape[0], nbins))

for i in range(arr.shape[0]):
    binned[i,:] = np.histogram(arr[i,:], bins=nbins)[0]

I feel like there should be a more direct and more efficient way to do that within numpy but I failed to find one.

obachtos
  • 977
  • 1
  • 12
  • 30

5 Answers5

17

You could use np.apply_along_axis:

x = np.array([range(20), range(1, 21), range(2, 22)])

nbins = 2
>>> np.apply_along_axis(lambda a: np.histogram(a, bins=nbins)[0], 1, x)
array([[10, 10],
       [10, 10],
       [10, 10]])

The main advantage (if any) is that it's slightly shorter, but I wouldn't expect much of a performance gain. It's possibly marginally more efficient in the assembly of the per-row results.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • 3
    Just to point out that the reason for using the `lambda a:` and `[0]` instead of just `np.apply_along_axis(np.histogram, axis=1, arr=x, bins=bins)` is because `np.histogram` returns two outputs, `hist` and `bin_edges`, and here we only want `hist`. – ThomasNicholas Aug 04 '18 at 22:19
2

I was a bit confused by the lambda in Ami's solution so I expanded it out to show what it's doing:

def hist_1d(a):
    return np.histogram(a, bins=bins)[0]

counts = np.apply_along_axis(hist_1d, axis=1, arr=x)
ThomasNicholas
  • 1,273
  • 11
  • 21
2

For pages of many, many, many small data series I think you can do a lot faster using something like numpy.digitize (like a lot faster). Here is an example with 5000 data series, each featuring a modest 50 data points and targeting as few as 10 discrete bin locations. The speedup in this case is about ~an order of magnitude compared to the np.apply_along_axis implementation. The implementation looks like:

def histograms( data, bin_edges ):
    indices = np.digitize(data, bin_edges)
    histograms = np.zeros((data.shape[0], len(bin_edges)-1))
    for i,index in enumerate(np.unique(indices)):
        histograms[:, i]= np.sum( indices==index, axis=1 )
    return histograms

And here are some timings and verification:

data = np.random.rand(5000, 50)
bin_edges = np.linspace(0, 1, 11)

t1 = time.perf_counter()
h1 = histograms( data, bin_edges )
t2 = time.perf_counter()
print('digitize ', 1000*(t2-t1)/10., 'ms')

t1 = time.perf_counter()
h2 = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_edges)[0], 1, data)
t2 = time.perf_counter()
print('numpy    ', 1000*(t2-t1)/10., 'ms')

assert np.allclose(h1, h2)

The result is something like this:

digitize  1.690 ms
numpy     15.08 ms

Cheers.

Axiomel
  • 175
  • 8
0

To bin a numpy array along any axis you may use :

def bin_nd_data(arr, bin_n = 2, axis = -1):
    """ bin a nD array along one specific axis, to check.."""
    ss = list( arr.shape )
    if ss[axis]%bin_n==0:
        ss[ axis ] = int( ss[axis]/bin_n)
        print('ss is ', ss )
        if axis==-1:
            ss.append( bin_n)
            return np.mean( np.reshape(arr, ss, order='F' ), axis=-1 )
        else:
            ss.insert( axis+1, bin_n )
            return np.mean( np.reshape(arr, ss, order='F' ), axis=axis+1 )
        
    else:
        print('bin nd data, not divisible bin given : array shape :', arr.shape, ' bin ', bin_n)
        return None

It is a slight bother to take into account the case 'axis=-1'.

Adrien Mau
  • 181
  • 5
-6

You have to use numpy.histogramdd specifically meant for your problem

Arpan Das
  • 321
  • 1
  • 3
  • 9
  • 3
    I don't quite get how. My understanding is that `histogramdd` is build for creating multidimensional histograms but I want to obtain several one dimensional histograms. – obachtos Oct 24 '16 at 09:56