Creating a function in Python for creating buckets from pandas dataframe values based on multiple conditions

Question

I asked this question and it helped me, but now my task is more complex.

My dataframe has ~100 columns and values with 14 scales.

{'Diseasetype': {0: 'Oncology',
 1: 'Oncology',
 2: 'Oncology',
 3: 'Nononcology',
 4: 'Nononcology',
 5: 'Nononcology'},
'Procedures1': {0: 100, 1: 300, 2: 500, 3: 200, 4: 400, 5: 1000},
'Procedures2': {0: 1, 1: 3, 2: 5, 3: 2, 4: 4, 5: 10},
'Procedures100': {0: 1000, 1: 3000, 2: 5000, 3: 2000, 4: 4000, 5: 10000}}

I want to convert each value in each column of the dataframe into a bucket value.

My current solution is:

def encoding(col, labels):
    return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)

onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'

df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)

It works well, if all columns of the dataframe has the same scale, but they do not. Remember, I have 14 different scales.

I would like to update the code above (or get another solution), which would allow me to bucket data. I cannot use the same range of values for bucketing everything.

My logic is the following:

If Disease == Oncology and Procedures1 on this scale, convert values to these buckets (1, 2, 3)

If Disease == Oncology and Procedures2 on this scale, convert values to these buckets (1, 2, 3)

If Disease != Oncology and Procedures77 on this scale, convert values to these buckets (4, 5, 6)

Example of a scale and buckets:

Procedures1 for Oncology: < 200 = 1, 200-400 = 2, >400 = 3

Procedures2 for Oncology: < 2 = 1, 2-4 = 2, >4 = 3

Procedures3 for Oncology: < 2000 = 1, 2000-4000 = 2, >4000 = 3

Procedures1 for nonOncology: < 200 = 4, 200-400 = 5, >400 = 6

Procedures2 for nonOncology: < 2 = 4, 2-4 = 5, >4 = 6

Procedures3 for nonOncology: < 2000 = 4, 2000-4000 = 5, >4000 = 6

Expected output (happy to provide more info!)

Diseasetype Procedures1 Procedures2 Procedures100
Oncology        1             1           1
Oncology        2             2           2
Oncology        3             3           3
Nononcology     4             4           4
Nononcology     5             5           5
Nononcology     6             6           6

Link with rules:

All scales are the same for Oncology and nonOncology, they just depend on Procedure? — Corralien, Feb 25 '22 at 13:14
Except if you have a mathematical relation, I think you have to create a file with all combinations — Corralien, Feb 25 '22 at 13:21
@Corralien, scales are different for Oncology and Non-Oncology. That is why I liked the first simple solution. It create a list of columns and tell which scale to apply for this Procedure. So with some manual tuning I would be able to apply different scales. Any idea on how to implement this? — Anakin Skywalker, Feb 25 '22 at 13:33
I can create some simple template. I have 14 scales for Oncology and 10 scales for Non-Oncology to be precise. — Anakin Skywalker, Feb 25 '22 at 13:34

score 1 · Accepted Answer · answered Feb 25 '22 at 14:14

I used an helper file with all scales (source at the end of answer):

Use melt to flatten your dataframe then filter out your rows with query and finally use pivot to reshape your dataframe. You can execute each line independently to show the transformations:

scales = pd.read_csv('scales.csv').fillna({'Start': -np.inf, 'End': np.inf})
out = (
  df.melt('Diseasetype', var_name='Procedure', ignore_index=False).reset_index()
    .merge(scales, on=['Diseasetype', 'Procedure'], how='left')
    .query("value.between(Start, End)")
    .pivot_table('Label', ['index', 'Diseasetype'], 'Procedure').astype(int)
    .droplevel(0).rename_axis(columns=None).reset_index()
)

Output:

>>> df
   Diseasetype  Procedures1  Procedures100  Procedures2
0     Oncology            1              1            1
1     Oncology            2              2            2
2     Oncology            3              3            3
3  Nononcology            4              4            4
4  Nononcology            5              5            5
5  Nononcology            6              6            6

Content of scales.csv:

Diseasetype,Procedure,Start,End,Label
Oncology,Procedures1,,200,1
Oncology,Procedures1,200,400,2
Oncology,Procedures1,400,,3
Oncology,Procedures2,,2,1
Oncology,Procedures2,2,4,2
Oncology,Procedures2,4,,3
Oncology,Procedures100,,2000,1
Oncology,Procedures100,2000,4000,2
Oncology,Procedures100,4000,,3
Nononcology,Procedures1,,200,4
Nononcology,Procedures1,200,400,5
Nononcology,Procedures1,400,,6
Nononcology,Procedures2,,2,4
Nononcology,Procedures2,2,4,5
Nononcology,Procedures2,4,,6
Nononcology,Procedures100,,2000,4
Nononcology,Procedures100,2000,4000,5
Nononcology,Procedures100,4000,,6

Here, I don't search any magical solution to your problem but think about standardize values between 0 and 1. Use probably: `pd.cut`, `np.quantile` or `np.linspace` to set labels. — Corralien, Feb 25 '22 at 14:23
thanks for putting so much time! I attached a sample file. Can you please look at it? — Anakin Skywalker, Feb 25 '22 at 15:07
if possible, I would greatly appreciate it. I store data like this, it would help me a lot! — Anakin Skywalker, Feb 25 '22 at 20:21
No. I recreated the structure of the table + gave example of ranges. Ranges can be different, I gave an example. If you can put something `in range(a, b)`, where I can change the input of the range manually, would be fantastic. — Anakin Skywalker, Feb 25 '22 at 20:39

Creating a function in Python for creating buckets from pandas dataframe values based on multiple conditions

1 Answers1