If we have a column with categorical data as seen in "model" and it's representative percentage obtained by a "sensor", also in a columnar numerical data,
import pandas as pd
import numpy as np
data = {'model': ['a','b','b','a','b','b','a','b','b','a','b','b'],
'sensor': [0.34, 0.2, 0.3, 0.1, 0.2, 0.3, 0.1, 0.2, 0.3, 0.1, 0.2, 0.3]
}
this makes:
+----+-----+------+
|idx |model|sensor|
| 1 | a | 0.34 |
| 2 | b | 0.20 |
| 3 | b | 0.30 |
| 4 | a | 0.10 |
| 5 | b | 0.20 |
| 6 | b | 0.30 |
| 7 | a | 0.10 |
| 8 | b | 0.20 |
| 9 | b | 0.30 |
| 10 | a | 0.10 |
| 11 | b | 0.20 |
| 12 | b | 0.30 |
+----+-----+------+
therefore: n=12 / na=4 / nb=8
is it possible to use a different bin to count/sum/aggregate based o the model column?
bin = {'a':[0, 0.1, 0.12, 0.2, 0.33, 1],
'b':[0, 0.08, 0.15, 0.24, 1] # <- please note that the bins have different min/max amounts
}
df = pd.DataFrame(data=data, columns=['model', 'sensor'])
In the research, i stumbled into this code (references below) and tried my own:
for key, value in bin.items():
# create labels based on the count of the bins,
# case transposition is a alternative
labels=['N{}'.format(x) for x in range(0, len(value))]
# create list/array of labeled result for each
# sensor data based on the passed bin list/array
labeled_bin = np.array(labels)[np.array(value).searchsorted(df['sensor'].values)]
and this works partially. The output is, clearly, a list of labeled data without the filter based on the model. I understand what I should do but mechanistically/syntactically this seems not possible.
As an answer, I was aiming for something like this:
+------+------+-------+-------+--------+--------+
| | n1 | n2 | n3 | n4 | n5 | <-- either named interval
+------+------+-------+-------+--------+--------+
| | 0-10 | 10-12 | 12-20 | 20-33 | 33-100 | <-- or inferior / superior limits
+------+------+-------+-------+--------+--------+
| a | 0 | 3 | 0 | 0 | 1 | <-- count and/or
+------+------+-------+-------+--------+--------+
| a | 0 | 0.75 | 0 | 0 | 0.25 | <-- interval relative distribution
+------+------+-------+-------+--------+--------+
+------+------+---------+---------+--------+
| |0-0.08|0.08-0.15|0.15-0.24|0.24-1 |
+------+------+---------+---------+--------+
| b | 0 | 0 | 4 | 4 |
+------+------+---------+---------+--------+
| b | 0 | 0 | 0.5 | 0.5 |
+------+------+---------+---------+--------+
references:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#data-munging
How to categorize data based on column values in pandas?