For pages of many, many, many small data series I think you can do a lot faster using something like numpy.digitize
(like a lot faster). Here is an example with 5000 data series, each featuring a modest 50 data points and targeting as few as 10 discrete bin locations. The speedup in this case is about ~an order of magnitude compared to the np.apply_along_axis
implementation. The implementation looks like:
def histograms( data, bin_edges ):
indices = np.digitize(data, bin_edges)
histograms = np.zeros((data.shape[0], len(bin_edges)-1))
for i,index in enumerate(np.unique(indices)):
histograms[:, i]= np.sum( indices==index, axis=1 )
return histograms
And here are some timings and verification:
data = np.random.rand(5000, 50)
bin_edges = np.linspace(0, 1, 11)
t1 = time.perf_counter()
h1 = histograms( data, bin_edges )
t2 = time.perf_counter()
print('digitize ', 1000*(t2-t1)/10., 'ms')
t1 = time.perf_counter()
h2 = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_edges)[0], 1, data)
t2 = time.perf_counter()
print('numpy ', 1000*(t2-t1)/10., 'ms')
assert np.allclose(h1, h2)
The result is something like this:
digitize 1.690 ms
numpy 15.08 ms
Cheers.