Vectorized calculation of simple statistics for bins of subarrays, separately for fixed-width bins and fixed-frequency bins

Question

I have an array of subarrays as follows:

[
    [...]
    [...]
      ⋮
    [...]
]

The lengths of each subarray are the same.
I need to bin each subarray and calculate the mean, standard deviation, median and other percentiles for each bin. I need separate results for binning by fixed width and by fixed frequency. The method should be vectorized i.e. no 'for loops' (or at least as few as possible and those that are not too costly, though of course separate methodologies for each binning technique are required). I don't know if this is even possible in a reasonably understandable manner (understandable for me as I am quite the noob, but if it works I'll do my best). For the fixed-width binning method you may assume that we are binning by the data ranges of the first subarray for ease.

How should I proceed?

Possibilities:
For fixed frequency binning the steps I had in mind were somehow doing a np.array_split at once by specifying the right axis argument, then filling the bins that are a one shorter with nan by using np.pad and now that the the subarrays are no longer composed of ragged sequences we will hopefully be able to apply np.nanmedian using again whatever axis designation that worked for the np.array_split. However, I don't know if any suitable such axis can be specified for the splitting and median operations and additionally I have seen that there is no way to avoid iterating through (not just each of the rows, but,) each of the bins to pad the shorter of these ragged sequences with the extra nan. Even if these iterations don't prove to costly and everything else works as fine I wouldn't know how to actually implement any step of this process. Nor do I know where to even begin for fixed-width binning.

Here is a vectorized solution that accomplishes what I want for only the mean for only a single array; I would certainly like to avoid iterating over each one of my subarrays and also do not understand the method enough to extend it to calculating the standard deviation, medians or any other percentiles.

If your suggested approach is through the pandas library e.g. using cut or qcut, is there a way this could be done without using for loops?

This is all very much related to my earlier question.
As I am new to this platform I'm not sure what the best practices are, I would ideally not like to delete that post since it serves to cast a wider net to solve my problem, whilst this post pursues a slightly more specific avenue described in that. I also would not want someone who has worked on an answer to that post to find it deleted. But, if it is quite clear that I should delete the earlier post do let me know.

EDIT: example with expected output, assume all objects are numpy arrays not lists
Example array:

[
    [0, 1, 2, 3, 4, 5, 6],
    [90, 45,  9, 88, 21, 59, 32],
    ⋮
]

Fixed-frequency of 3 objects per bin binned example

[
    [[0, 1, 2], [3, 4], [5, 6]],
    [[90, 45,  9], [88, 21], [59, 32]],
    ⋮
]

The above intermediate step need not be explicitly returned at any point but illustrates what will be occurring behind the scenes.

Output of medians of Fixed-frequency binned example

[
    [1, 3.5, 5.5],
    [45, 54.5, 45.5],
    ⋮
]

Edit 2: extended question using @hilberts_drinking_problem answer as accepted solution for the original problem
If x = [0, 1, 2, 3, 4, 5, 6] and y = [90, 45, 9, 88, 21, 59, 32] then you have already calculated everything (except the percentiles) I want for the data sorted by x. If I also want the same statistics but the data sorted by y with a multi-index such that df_2's row indices print as follows:

# x_srtd   x  
#          y  
# y_srtd   x  
#          y

How would I get this (including sorting x and y again by y) without for loops. (In case it matters note I plan on transposing the entire df_2 using.T at the end, for readability, such that 'x_srtd', 'y_srtd', 'x' and 'y' become column headers.
Also which of the methods in Pass percentiles to pandas agg function would you recommend?
Almost forgot, any ideas for how I would approach fixed-width binning keeping in mind x-sorted binning is going to be different to the y-sorted binning. As examples, take bin_width_x = 1.5 for binning by x and similarly bin_width_y = 25.

Create a sample data and show us what is the expected output. https://stackoverflow.com/help/minimal-reproducible-example — , Jan 07 '22 at 08:16

score 0 · Answer 1 · answered Jan 07 '22 at 10:03

You could split the columns of your DataFrame into a MultiIndex so that the zeroth level of the multiindex represents a group of columns you wish to aggregate. Here is an example:

import pandas as pd
import numpy as np

df = pd.DataFrame([
    [0, 1, 2, 3, 4, 5, 6],
    [90, 45,  9, 88, 21, 59, 32],
])

df.columns = pd.MultiIndex.from_tuples(
    [(i, c) for i, gp in enumerate(np.array_split(df.columns, 3)) for c in gp]
)
# print(df)
#     0          1       2    
#     0   1  2   3   4   5   6
# 0   0   1  2   3   4   5   6
# 1  90  45  9  88  21  59  32

print(df.groupby(axis=1, level=0).agg("mean"))
#       0     1     2
# 0   1.0   3.5   5.5
# 1  48.0  54.5  45.5

# the following raises not implemented error on Pandas version 1.1.5
# print(df.groupby(axis=1, level=0).agg(["mean", "std"]))

# as a workaround:
operations = ["mean", "std", "median"]
df2 = pd.concat((
    df.groupby(axis=1, level=0).agg(operation)
    for operation in operations
), axis=1)
df2.columns = pd.MultiIndex.from_product([
  operations, np.unique(df.columns.get_level_values(0))])
print(df2)
#    mean                    std                       median            
#       0     1     2          0          1          2      0     1     2
# 0   1.0   3.5   5.5   1.000000   0.707107   0.707107    1.0   3.5   5.5
# 1  48.0  54.5  45.5  40.583248  47.376154  19.091883   45.0  54.5  45.5

Wow, your concise pandas manipulation is impressive, tho unfortunately I can't upvote you because of rep. Since you know your stuff I would like to request additional advice on how to manage my data. I have posted an edit 2 to the original question, would love to hear your thoughts. — blinking_elk, Jan 08 '22 at 11:30

Vectorized calculation of simple statistics for bins of subarrays, separately for fixed-width bins and fixed-frequency bins

1 Answers1