Grouping data by bin with np.digitize

Question

I am looking for a way to average the data that I have in an array based on how far it is from a certain pixel. To achieve this I have made an array r which contains the distances to the center. There is a second array data that contains the counts that can be found in the pixel at that distance.

Now I have split the entire dataset (that goes from 0-1150) into 60 bins and then digitized the data to get an array that tells me which value belongs into which bin.

bins = np.linspace(0,60*20, 60)
digitized = np.digitize(rr, bins)

Is there a smart way to apply digitized to the data so that all points with the same bin value get averaged?

Array r has a shape of 380, data is the same. So the end result should be an array of 60 elements that has the average of all the binned values in data based on which bin was assigned to digitized.

It would help if you mention the shapes of all arrays involved, including the final result. — fountainhead, Mar 14 '19 at 13:07
What does each element of `data` represent? You mentioned that it "contains the counts that can be found in the pixel at that distance". Could you please explain that? — fountainhead, Mar 14 '19 at 13:32
Also, what is the desired output? Is it an array of the same size as rr where elements belonging to the same bin will get the same average value? — Dammi, Mar 14 '19 at 13:34
Currently It contains the std of the counts, but I need it to be a list of values. — Coolcrab, Mar 14 '19 at 13:39
"Currently It contains the std of the counts". Was that in response to my question about `data`? I'm afraid it didn't help me. What does "the counts" mean here? If `data[3]` contains 20, what does it mean? And preferably, please mention in an edit of your question. — fountainhead, Mar 14 '19 at 13:50
it just contains singular values. So data[3] is a float. (58 in this case) — Coolcrab, Mar 14 '19 at 13:52
OK, so these are floats that represent the "counts that can be found in the pixel at that distance" . Still, the role of these values in the desired output is not clear, or maybe it's just me. — fountainhead, Mar 14 '19 at 13:55

Dammi · Answer 1 · 2019-03-14T14:29:07.160

Here is my attempt, although I assume you are looking for something considerably more elegant? :)

rr = np.random.randint(0, 15, 1000)
rr_sorted = np.sort(rr)

# Bins
bins = [0, 5, 10, 15]

def assume_sorted_digitized(rr_sorted, bins):
    dig = np.digitize(rr_sorted, bins)
    bin_nr, index = np.unique(dig, return_index=True)
    index_adjusted = np.append(index[1:], len(rr_sorted))
    bin_average = np.zeros_like(bins).astype(np.float32)
    last_idx = 0
    for idx, bin_i in zip(index_adjusted, bin_nr):
        bin_average[bin_i] = rr_sorted[last_idx:idx].mean()
        last_idx = idx
    return bin_average

def nonsorted_digitized(rr, bins):
    dig = np.digitize(rr, bins)
    bin_average = np.zeros_like(bins).astype(np.float32)
    for idx in np.unique(dig):
        bin_average[idx] = rr[dig == idx].mean()
    return bin_average

%timeit assume_sorted_digitized(rr_sorted, bins)
%timeit nonsorted_digitized(rr, bins)

Assuming it's sorted gives a slight performance boost

86.5 µs ± 5.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
113 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Grouping data by bin with np.digitize

1 Answers1