1

I have a dataframe that looks like this

Name Code Amount Seller
Sildenafil 1045 VGRA 32 Rite Aid
Ibuprofen 378 ADVL 209 CVS
Paracetamol PCTML 87 Keystone
Aspirin 9852 DISP 372 Rite Aid
Ibuprofen 1992 MOTR 87 Walgreens
Benadryl BDRL 120 CVS
Aspirin 0541 SPRN 197 H Mart
Sildenafil 002 VIAG 12 Omnicare
Ibuprofen 378 ADVL 301 Keystone
Paracetamol PCTML 673 Walgreens
Ibuprofen 1992 MOTR 87 Omnicare
Sildenafil 1045 VGRA 45 H Mart
Benadryl BDRL 111 Keystone
Aspirin 9852 DISP 285 CVS
Sildenafil 002 VIAG 79 Rite Aid
Aspirin 0541 SPRN 431 Omnicare

where df.Code.unique() = [VGRA, ADVL, PCTML, DISP, MOTR, BDRL, SPRN, VIAG]

I want to identify the medicines of the same 'Name' (but different ending numbers) with the same 'Code'

Ex. 'Ibuprofen 378' and 'Ibuprofen 1992' have two different codes- 'ADVL' and 'MOTR' respectively. I want to replace all instances of 'MOTR' with 'ADVL'. Same for Sildenafil and Aspirin and any other drugs with the same coding situation. The output should look like this:

Name Code Amount Seller
Sildenafil 1045 VGRA 32 Rite Aid
Ibuprofen 378 ADVL 209 CVS
Paracetamol PCTML 87 Keystone
Aspirin 9852 DISP 372 Rite Aid
Ibuprofen 1992 ADVL 87 Walgreens
Benadryl BDRL 120 CVS
Aspirin 0541 DISP 197 H Mart
Sildenafil 002 VGRA 12 Omnicare
Ibuprofen 378 ADVL 301 Keystone
Paracetamol PCTML 673 Walgreens
Ibuprofen 1992 ADVL 87 Omnicare
Sildenafil 1045 VGRA 45 H Mart
Benadryl BDRL 111 Keystone
Aspirin 9852 DISP 285 CVS
Sildenafil 002 VGRA 79 Rite Aid
Aspirin 0541 DISP 431 Omnicare

where df.Code.unique() = [VGRA, ADVL, PCTML, DISP, BDRL]

I cannot figure out how to do this, the only thing I've managed is generating a third variable using

df.Name_Code = df.Name.astype(str).str[:5]

Name Code Amount Seller Name_Code
Sildenafil 1045 VGRA 32 Rite Aid SILDE
Ibuprofen 378 ADVL 209 CVS IBUPR
Paracetamol PCTML 87 Keystone PARAC
Aspirin 9852 DISP 372 Rite Aid ASPIR
Ibuprofen 1992 ADVL 87 Walgreens IBUPR
Benadryl BDRL 120 CVS BENAD
Aspirin 0541 DISP 197 H Mart ASPIR
Sildenafil 002 VGRA 12 Omnicare SILDE
Ibuprofen 378 ADVL 301 Keystone IBUPR
Paracetamol PCTML 673 Walgreens PARAC
Ibuprofen 1992 ADVL 87 Omnicare IBUPR
Sildenafil 1045 VGRA 45 H Mart SILDE
Benadryl BDRL 111 Keystone BENAD
Aspirin 9852 DISP 285 CVS ASPIR
Sildenafil 002 VGRA 79 Rite Aid SILDE
Aspirin 0541 DISP 431 Omnicare ASPIR

And while that does uniformize the drugs, it's less helpful. Reading "VGRA" for ex. would be more helpful than reading "SILDE". How can I modify this?

2 Answers2

0

If I understand you correctly, you want to set the Code to the first observed Code value for each medicine:

df['Code'] = df.groupby(df['Name'].str.split().str[0])['Code'].transform('first')
print(df)

Prints:

               Name   Code  Amount     Seller
0   Sildenafil 1045   VGRA      32   Rite Aid
1     Ibuprofen 378   ADVL     209        CVS
2       Paracetamol  PCTML      87   Keystone
3      Aspirin 9852   DISP     372   Rite Aid
4    Ibuprofen 1992   ADVL      87  Walgreens
5          Benadryl   BDRL     120        CVS
6      Aspirin 0541   DISP     197     H Mart
7    Sildenafil 002   VGRA      12   Omnicare
8     Ibuprofen 378   ADVL     301   Keystone
9       Paracetamol  PCTML     673  Walgreens
10   Ibuprofen 1992   ADVL      87   Omnicare
11  Sildenafil 1045   VGRA      45     H Mart
12         Benadryl   BDRL     111   Keystone
13     Aspirin 9852   DISP     285        CVS
14   Sildenafil 002   VGRA      79   Rite Aid
15     Aspirin 0541   DISP     431   Omnicare
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

This question shares some similarities with another I found, so I'd definitely recommend searching the archives.

Order-Agnostic Approach

I say order-agnostic because the other answer is ideal if you're looking to just use the 'first' value.

It seems like you're going to need a dict/hashmap that serves as your source of truth for re-defining the Code based on the Name. Given what you wrote in the question it would look something like this:

name_to_code_mapping = {"Ibuprofen": "ADVL", "Sildenafil": "...", ...}

You can then use the dict inside of a function that you can pass to something like df.apply:

def change_code_based_on_drug_name(df_row: pd.Series, column_to_find: str, column_to_change: str, map: dict):
    val_to_find = df_row[column_to_find]
    ## You'll have to split the numbers off of the name here
    trimmed_df_value = val_to_find.split(" ")[0]
    
    if trimmed_df_value in map:
        df_row[column_to_change] = map[trimmed_df_value]
    return df_row

And feed it to that df.apply:

df = df.apply(lambda row: change_code_based_on_drug_name(row, "Name", "Code", name_to_code_mapping), axis=1)

Alternative, Less Ideal Approach

You could also iterate over the map and use .loc[] (like in the above S/O post) to change the values but you would need to create a column of the name minus numbers:

## Notice that since I'm using apply over a single column (i.e. Series) here I don't need to axis paramter
df["Name_without_numbers"] = df["Name"].apply(lambda name: name.split(" ")[0])

You'd then set that to the index so you can use the .loc method.

df.set_index("Name_without_numbers", inplace=True)

for key, val in name_to_code_mapping.items():
    df.loc[key, "Code"] = val

I don't like or recommend that second approach because:

  1. It adds an unnecessary column, and
  2. Creates a non-unique index

Hope that helps!

jaellis
  • 18
  • 2