0

If we have a column with categorical data as seen in "model" and it's representative percentage obtained by a "sensor", also in a columnar numerical data,

import pandas as pd
import numpy as np

data = {'model':  ['a','b','b','a','b','b','a','b','b','a','b','b'],
        'sensor': [0.34, 0.2, 0.3, 0.1, 0.2, 0.3, 0.1, 0.2, 0.3, 0.1, 0.2, 0.3]
       }

this makes:

+----+-----+------+
|idx |model|sensor|
| 1  |  a  | 0.34 |
| 2  |  b  | 0.20 |
| 3  |  b  | 0.30 |
| 4  |  a  | 0.10 |
| 5  |  b  | 0.20 |
| 6  |  b  | 0.30 |
| 7  |  a  | 0.10 |
| 8  |  b  | 0.20 |
| 9  |  b  | 0.30 |
| 10 |  a  | 0.10 |
| 11 |  b  | 0.20 |
| 12 |  b  | 0.30 |
+----+-----+------+

therefore: n=12 / na=4 / nb=8

is it possible to use a different bin to count/sum/aggregate based o the model column?

bin = {'a':[0, 0.1, 0.12, 0.2, 0.33, 1],
       'b':[0, 0.08, 0.15, 0.24, 1]  # <- please note that the bins have different min/max amounts
       }

df = pd.DataFrame(data=data, columns=['model', 'sensor'])

In the research, i stumbled into this code (references below) and tried my own:

for key, value in bin.items():
    # create labels based on the count of the bins,
    # case transposition is a alternative    
    labels=['N{}'.format(x) for x in range(0, len(value))]

    # create list/array of labeled result for each 
    # sensor data based on the passed bin list/array
    labeled_bin = np.array(labels)[np.array(value).searchsorted(df['sensor'].values)]

and this works partially. The output is, clearly, a list of labeled data without the filter based on the model. I understand what I should do but mechanistically/syntactically this seems not possible.

As an answer, I was aiming for something like this:

+------+------+-------+-------+--------+--------+
|      | n1   | n2    | n3    | n4     | n5     | <-- either named interval
+------+------+-------+-------+--------+--------+
|      | 0-10 | 10-12 | 12-20 | 20-33  | 33-100 | <-- or inferior / superior limits
+------+------+-------+-------+--------+--------+
| a    |  0   |     3 |   0   |   0    |   1    | <-- count and/or
+------+------+-------+-------+--------+--------+
| a    |  0   |  0.75 |   0   |   0    |  0.25  | <-- interval relative distribution
+------+------+-------+-------+--------+--------+

+------+------+---------+---------+--------+
|      |0-0.08|0.08-0.15|0.15-0.24|0.24-1  |
+------+------+---------+---------+--------+
| b    |   0  |    0    |     4   |    4   |
+------+------+---------+---------+--------+
| b    |   0  |    0    |    0.5  |   0.5  |
+------+------+---------+---------+--------+

references:

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#data-munging

How to categorize data based on column values in pandas?

Count frequency of values in pandas DataFrame column

Pandas groupby with bin counts

dormanino
  • 111
  • 13

1 Answers1

1

Assuming that you are only working with two models you can do use pd.cut if there are more you will probably need to make a function so you do not have to know each group for the models.

# create labels
labels = {'a': ['0-10', '10-12', '12-20', '20-33', '33-100'],
          'b': ['0-0.08', '0.08-0.15', '0.15-0.24', '0.24-1']}

# create bins
bins = {'a':[0, 0.1, 0.12, 0.2, 0.33, 1],
        'b':[0, 0.08, 0.15, 0.24, 1]
       }

# get_dummis and cut where model == a
a = df[df['model'] == 'a'][['model']].join(pd.get_dummies(pd.cut(df[df['model'] == 'a']['sensor'],
                                                                   bins['a'], labels=labels['a'])))
# get the counts for each bin
new_a = a.groupby('model').sum()

# groupby and sum and calc percent for each bin
final_a = new_a.append(a.groupby('model').sum().apply(lambda x: x/x.sum(), axis=1))


# get_dummis and cut where model == b
b = df[df['model'] == 'b'][['model']].join(pd.get_dummies(pd.cut(df[df['model'] == 'b']['sensor'],
                                                                    bins['b'], labels=labels['b'])))
# get the counts for each bin
new_b = b.groupby('model').sum()

# groupby and sum and calc percent for each bin
final_b = new_b.append(b.groupby('model').sum().apply(lambda x: x/x.sum(), axis=1))


       0-10  10-12  12-20  20-33  33-100
model                                   
a      3.00    0.0    0.0    0.0    1.00
a      0.75    0.0    0.0    0.0    0.25

##################################################

       0-0.08  0.08-0.15  0.15-0.24  0.24-1
model                                      
b         0.0        0.0        4.0     4.0
b         0.0        0.0        0.5     0.5

A function could look something like this but it probably wont be very fast on large datasets

def create_bins(df, labels, bins):
    models = df['model'].unique().tolist()

    dfs = {}

    for m in models:
        df_bin = df[df['model'] == m][['model']].join(pd.get_dummies(pd.cut(df[df['model'] == m]['sensor'],
                                                                   bins[m], labels=labels[m])))

        new_df = df_bin.groupby('model').sum()

        final_df = new_df.append(new_df.groupby('model').sum().apply(lambda x: x/x.sum(), axis=1))

        dfs[m] = final_df

    return dfs


dfs = create_bins(df, labels, bins)

you can then call the results by the model name: dfs['a']

print(dfs['a'])

       0-10  10-12  12-20  20-33  33-100
model                                   
a      3.00    0.0    0.0    0.0    1.00
a      0.75    0.0    0.0    0.0    0.25
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • Hey @Chris, just in case I needed the models grouped (w.o. the cut) at the same df, with one kind of output i.e the relative amount by interval or the count...is that also possible? – dormanino Nov 17 '19 at 19:46
  • Yep, I was arriving on it without pandas...but your answer solves it. Thanks – dormanino Nov 17 '19 at 19:50