0

I have a DataFrame with dozens of columns.

Therapy area    Procedures1 Procedures2 Procedures3
Oncology        450         450         2345
Oncology        367         367         415
Oncology        152         152         4945
Oncology        876         876         345
Oncology        1098        1098        12
Oncology        1348        1348        234
Nononcology     225         225         345
Nononcology     300         300         44
Nononcology     267         267         45
Nononcology     90          90          4567

I want to change numeric values in all Procedure columns into buckets.

For one column it will be something like

def hello(x):
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] < 200: return int(1)
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (200, 500): return 2
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] in range (500, 1000): return 3
    if x['Therapy area'] == 'Oncology' and x['Procedures1'] > 1000: return 4
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] < 200: return 11
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (200, 500): return 22
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] in range (500, 1000): return 33
    if x['Therapy area'] != 'Oncology' and x['Procedures1'] > 1000: return 44  
test['Procedures1'] = test.apply(hello, axis=1)

What is the most efficient way to apply this for dozens of columns with different column names (not Procedures1, Procedures2, 'Procedures3` etc)?

When using cut with specific bins I get the error:

ValueError: bins must increase monotonically.

Since I can have different values how can I solve this with logical operations, not bins?

Also the values can be different depending on the "Therapy area" column, like 11, 22, 33, 44 for Nononcology and 1, 2, 3, 4 for Oncology.

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63

1 Answers1

2

You could apply pd.cut to the relevant columns:

cols = ['Procedures1', 'Procedures2']
df[cols] = df[cols].apply(lambda col: pd.cut(col, [0,200,500,1000, col.max()], labels=[1,2,3,4]))

Output:

  Therapy_area Procedures1 Procedures2
0     Oncology           2           2
1     Oncology           2           2
2     Oncology           1           1
3     Oncology           3           3
4     Oncology           4           4
5     Oncology           4           4
6  Nononcology           2           2
7  Nononcology           2           2
8  Nononcology           2           2
9  Nononcology           1           1

You could also use np.select:

def encoding(col, labels):
    return np.select([col<200, col.between(200,500), col.between(500,1000), col>1000], labels, 0)

onc_labels = [1,2,3,4]
nonc_labels = [11,22,33,44]
msk = df['Therapy_area'] == 'Oncology'

df[cols] = pd.concat((df.loc[msk, cols].apply(encoding, args=(onc_labels,)), df.loc[msk, cols].apply(encoding, args=(nonc_labels,)))).reset_index(drop=True)

Output:

  Therapy_area  Procedures1  Procedures2  Procedures3
0     Oncology            2            2            4
1     Oncology            2            2            2
2     Oncology            1            1            4
3     Oncology            3            3            2
4     Oncology            4            4            1
5     Oncology            4            4            2
6  Nononcology           22           22           44
7  Nononcology           22           22           22
8  Nononcology           11           11           44
9  Nononcology           33           33           22