I have three 1D arrays:
idxs
: the index dataweights
: the weight of each index inidxs
bins
: the bin which is used to calculate minimum weight in it.
Here's my current method of using the idxs
to check the data called weights
in which bin, and then calculate the min/max of binned weights:
- Get
slices
that shows which bin eachidxs
element belongs to. - Sort
slices
andweights
at the same time. - Calculate the minimum of
weights
in each bin (slice).
numpy method
import random
import numpy as np
# create example data
out_size = int(10)
bins = np.arange(3, out_size-3)
idxs = np.arange(0, out_size)
#random.shuffle(idxs)
# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]
weights = idxs
# get which bin idxs belong to
slices = np.digitize(idxs, bins)
# get index and weights in bins
valid = (bins.max() >= idxs) & (idxs >= bins.min())
valid_slices = slices[valid]
valid_weights = weights[valid]
# sort slice and weights
sort_index = valid_slices.argsort()
valid_slices_sort = valid_slices[sort_index]
valid_weights_sort = valid_weights[sort_index]
# get index of each first unque slices
unique_slices, unique_index = np.unique(valid_slices_sort, return_index=True)
# calculate the minimum
res_sub = np.minimum.reduceat(valid_weights_sort, unique_index)
# save results
res = np.full((out_size), np.nan)
res[unique_slices-1] = res_sub
print(res)
Results:
array([ 3., nan, 5., nan, nan, nan, nan, nan, nan, nan])
If I increase the out_size
to 1e7 and shuffle the data, the speed (from np.digitize to the end) is slow:
13.5 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And, here's the consumed time of each part:
np.digitize: 10.8 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
valid: 171 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
argsort and slice: 2.02 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
unique: 9.9 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.minimum.reduceat: 5.11 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.digitize()
costs the most of time: 10.8 s. And, the next is argsort
: 2.02 s.
I also check the consumed time of calculating mean
using np.histogram
:
counts, _ = np.histogram(idxs, bins=out_size, range=(0, out_size))
sums, _ = np.histogram(idxs, bins=out_size, range=(0, out_size), weights = weights, density=False)
mean = sums / np.where(counts == 0, np.nan, counts)
33.2 s ± 3.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is similar to my method of calculating minimum.
scipy method
from scipy.stats import binned_statistic
statistics, _, _ = binned_statistic(idxs, weights, statistic='min', bins=bins)
print(statistics)
The result is a little different, but the speed is much slower (x6) for the longer(1e7) shuffled data:
array([ 3., nan, 5.])
1min 20s ± 6.93 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Summary
I want to figure out a quicker method. If the method is also suitable for dask, that would be excellent!