Problem description
I have a 3D numpy array, denoted as data
, of shape N x R x C, i.e. N samples, R rows and C columns. I would like to obtain histograms along column for each combination of sample and row. However bin edges (see argument bins
in numpy.histogram
), of fixed length S, will be different at different rows but are shared across samples. Consider this example for illustration, for the 1st sample (data[0]
), bin edge sequence for its 1st row is different from that for its 2nd row, but is the same as that for the 1st row from the 2nd sample (data[1]
). Thus all the bin edge sequences are stored in a 2D numpy array of shape R x S, denoted as bin_edges
.
My question is how to efficiently calculate the histograms?
A working but slow solution
Using numpy.histogram
, I was able to come up with a working but fairly slow solution as shown in the below code snippet
```
Get dummy data
N: number of samples
R: number of rows (or kernels)
C: number of columns (or pixels)
S: number of bins
```
import numpy as np
N, R, C, S = 100, 50, 1000, 10
data = np.random.randn(N, R, C)
# for each row/kernel, pool pixels of all samples
poolsamples = np.swapaxes(data, 0, 1).reshape(R, -1)
# use quantiles as bin edges
percentiles = np.linspace(0, 100, num=(S + 1))
bin_edges = np.transpose(np.percentile(poolsamples, percentiles, axis=1))
```
A working but slow solution of getting histograms along column
```
hist = np.empty((N, R, S))
for idx in np.arange(R):
bin_edges_i = bin_edges[idx, :]
counts = np.apply_along_axis(
lambda a: np.histogram(a, bins=bin_edges_i)[0],
1, data[:, idx, :])
hist[:, idx, :] = counts
Possible directions
- Fancy numpy reshape to avoid using for loop at all
- This problem arises from extracting low-end characteristics for each image forwarded through a trained neural network. Therefore, if the histogram extraction can be embedded in TensorFlow graph and ultimately be carried out on GPU, that would be ideal!
- I noticed a python package fast-histogram which claims to be 7-15x faster than
numpy.histogram
. However 1d histogram function can only takes number of bins instead of actual bin positions - numexpr?
I would love to hear any inputs! Thanks in advance!