0

I am working on a dataframe consisting of a variable of different codes for cancer diseases. These codes consist of either 5 numbers or "DC" followed by 2-3 numbers (string variable). I want to create a new variable (cancer_type) that takes the values from the disease code variable (cancer_code) and assign a category (values of 1 to 12 for example) to cancer_type.

It should be something like this:

# pseudo-code
if df[cancer_code] == ("1400-1499" or "DC00-DC148") -> df[cancer_group] = 1
if df[cancer_code] == ("1500-1599" or "DC150-159") -> df[cancer_group] = 2

I have found many examples of how to use conditions on variables of integers/floats, but none on a "range" of strings. Is there any easy way to do this? I am using pandas.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • 2
    You explained in text what you have. Please read [ask] and prepare a [mre] that includes a [good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and show what you tried to code. – Patrick Artner Jun 30 '21 at 06:20

2 Answers2

0

you can create a mapping dict and then use that mapping_dict to map the values

map_dict ={"1400-1499": 1, "DC00-DC148" : 1 , "1500-1599": 2, "DC150-159" :2}
df['cancer_group'] = df.cancer_code.map(map_dict)
Nk03
  • 14,699
  • 2
  • 8
  • 22
0

Alternate way to do it yourself:

You can create a mapping from the unique values of the column you are after and apply it to its series and store as new column:

import pandas as pd
from random import choice
data = ["One", "Two", "Fourty-Two oder More", "Not Categorized"]

# random demo data
df = pd.DataFrame({ "DataPoint": [f"Patient_{i:03}" for i in range(30)],
                    "Category": [choice(data) for _ in range(30)]})

# create an automatic mapper dict from the unique values of the column
# you can finetune it by providing a fixed own wrapper if you like
mapper = {k: idx for idx, k in enumerate(df.Category.unique())}

#apply mapper and save as new data
df["mapped"] = df["Category"].apply(mapper.get)

print(df)

Output:

      DataPoint              Category  mapped
0   Patient_000                   One       0
1   Patient_001       Not Categorized       1
2   Patient_002       Not Categorized       1
3   Patient_003                   Two       2
4   Patient_004  Fourty-Two oder More       3
..       ...              ...              ...
26  Patient_026                   One       0
27  Patient_027       Not Categorized       1
28  Patient_028                   Two       2
29  Patient_029                   One       0

Let pandas do it for you:

You can declare your column categorical (answer attributation) and let pandas do the rest:

df = pd.DataFrame({ "DataPoint": [f"Patient_{i:03}" for i in range(30)],
                    "Category": [choice(data) for _ in range(30)]})

df.Category = pd.Categorical(df.Category)
df["NumericalCat"] = df.Category.cat.codes

print(df)

Output:

      DataPoint              Category  NumericalCat
0   Patient_000                   One             2
1   Patient_001  Fourty-Two oder More             0
2   Patient_002       Not Categorized             1
3   Patient_003                   Two             3
4   Patient_004                   One             2
..       ...              ...                    ...
26  Patient_026  Fourty-Two oder More             0
27  Patient_027                   Two             3
28  Patient_028                   Two             3
29  Patient_029  Fourty-Two oder More             0
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69