0

I asked this question and it helped me, but now my task is more complex.

My dataframe has ~100 columns and values with 14 scales.

{'Diseasetype': {0: 'Oncology',
 1: 'Oncology',
 2: 'Oncology',
 3: 'Nononcology',
 4: 'Nononcology',
 5: 'Nononcology'},
'Procedures1': {0: 100, 1: 300, 2: 500, 3: 200, 4: 400, 5: 1000},
'Procedures2': {0: 1, 1: 3, 2: 5, 3: 2, 4: 4, 5: 10},
'Procedures100': {0: 1000, 1: 3000, 2: 5000, 3: 2000, 4: 4000, 5: 10000}}

I want to convert each value in each column of the dataframe into a bucket value.

My current solution is:

def encoding(col, labels):
    return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)

onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'

df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)

It works well, if all columns of the dataframe has the same scale, but they do not. Remember, I have 14 different scales.

I would like to update the code above (or get another solution), which would allow me to bucket data. I cannot use the same range of values for bucketing everything.

My logic is the following:

If Disease == Oncology and Procedures1 on this scale, convert values to these buckets (1, 2, 3)

If Disease == Oncology and Procedures2 on this scale, convert values to these buckets (1, 2, 3)

If Disease != Oncology and Procedures77 on this scale, convert values to these buckets (4, 5, 6)

Example of a scale and buckets:

Procedures1 for Oncology: < 200 = 1, 200-400 = 2, >400 = 3

Procedures2 for Oncology: < 2 = 1, 2-4 = 2, >4 = 3

Procedures3 for Oncology: < 2000 = 1, 2000-4000 = 2, >4000 = 3

Procedures1 for nonOncology: < 200 = 4, 200-400 = 5, >400 = 6

Procedures2 for nonOncology: < 2 = 4, 2-4 = 5, >4 = 6

Procedures3 for nonOncology: < 2000 = 4, 2000-4000 = 5, >4000 = 6

Expected output (happy to provide more info!)

Diseasetype Procedures1 Procedures2 Procedures100
Oncology        1             1           1
Oncology        2             2           2
Oncology        3             3           3
Nononcology     4             4           4
Nononcology     5             5           5
Nononcology     6             6           6

Link with rules:

Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63
  • 1
    All scales are the same for Oncology and nonOncology, they just depend on Procedure? – Corralien Feb 25 '22 at 13:14
  • 1
    Except if you have a mathematical relation, I think you have to create a file with all combinations – Corralien Feb 25 '22 at 13:21
  • @Corralien, scales are different for Oncology and Non-Oncology. That is why I liked the first simple solution. It create a list of columns and tell which scale to apply for this Procedure. So with some manual tuning I would be able to apply different scales. Any idea on how to implement this? – Anakin Skywalker Feb 25 '22 at 13:33
  • I can create some simple template. I have 14 scales for Oncology and 10 scales for Non-Oncology to be precise. – Anakin Skywalker Feb 25 '22 at 13:34

1 Answers1

1

I used an helper file with all scales (source at the end of answer):

scales

Use melt to flatten your dataframe then filter out your rows with query and finally use pivot to reshape your dataframe. You can execute each line independently to show the transformations:

scales = pd.read_csv('scales.csv').fillna({'Start': -np.inf, 'End': np.inf})
out = (
  df.melt('Diseasetype', var_name='Procedure', ignore_index=False).reset_index()
    .merge(scales, on=['Diseasetype', 'Procedure'], how='left')
    .query("value.between(Start, End)")
    .pivot_table('Label', ['index', 'Diseasetype'], 'Procedure').astype(int)
    .droplevel(0).rename_axis(columns=None).reset_index()
)

Output:

>>> df
   Diseasetype  Procedures1  Procedures100  Procedures2
0     Oncology            1              1            1
1     Oncology            2              2            2
2     Oncology            3              3            3
3  Nononcology            4              4            4
4  Nononcology            5              5            5
5  Nononcology            6              6            6

Content of scales.csv:

Diseasetype,Procedure,Start,End,Label
Oncology,Procedures1,,200,1
Oncology,Procedures1,200,400,2
Oncology,Procedures1,400,,3
Oncology,Procedures2,,2,1
Oncology,Procedures2,2,4,2
Oncology,Procedures2,4,,3
Oncology,Procedures100,,2000,1
Oncology,Procedures100,2000,4000,2
Oncology,Procedures100,4000,,3
Nononcology,Procedures1,,200,4
Nononcology,Procedures1,200,400,5
Nononcology,Procedures1,400,,6
Nononcology,Procedures2,,2,4
Nononcology,Procedures2,2,4,5
Nononcology,Procedures2,4,,6
Nononcology,Procedures100,,2000,4
Nononcology,Procedures100,2000,4000,5
Nononcology,Procedures100,4000,,6
Corralien
  • 109,409
  • 8
  • 28
  • 52