How to compare two columns of diffrent dataframes and create a new one

Question

Hello guys i am new to python.I have two dataframes.One contains the description of medicines which looks like this :

df1.head(5)

PID  Drug_Admin_Description
1       sodium chloride 0.9% SOLN
2       Nimodipine 30 mg oral
3       Livothirine 20 mg oral
4       Livo tab 112
5       Omega-3 Fatty Acids

Other table has only drug names, which looks like this :

df2.head(5)

Drug_Name 

Sodium chloride 0.5% SOLN
omega-3 Fatty Acids
gentamicin 40 mg/ml soln
amoxilin 123
abcd 12654

Is there a way i can extract only those medicines which are in both df1 and df2.The sample output would look like this :

new_column

Sodium chloride
omega-3

I tried using regex in python but not able to figure out how would i apply this.Thanks in advance

Logically, how do you identify a drug name from the rest of the string? — ALollz, Oct 07 '19 at 16:20
Yes you are right that would be a problem.But for now anything which is common between df1 and df2 would work later i can clean the data.Thank You — Akash Meghani, Oct 07 '19 at 16:29
If you want common elements, see here: https://stackoverflow.com/questions/18079563/finding-the-intersection-between-two-series-in-pandas — Valentino, Oct 07 '19 at 17:01

score 1 · Answer 1 · answered Oct 07 '19 at 17:14

One possibility would be to use the get_close_matches from the difflib library.

import pandas as pd
import difflib

drug_description = ["sodium chloride 0.9% SOLN","Nimodipine 30 mg oral",
                    "Livothirine 20 mg oral", "Livo tab 112",
                    "Omega-3 Fatty Acids"]

df1 = pd.DataFrame({"Drug_Admin_Description":drug_description})


drug_name = ["Sodium chloride 0.5% SOLN", "omega-3 Fatty Acids",
            "gentamicin 40 mg/ml soln", "amoxilin 123", "abcd 12654"]

df2 = pd.DataFrame({"Drug_Name":drug_name})
# The above code is to create the dataframe with the information you provided



match_list = [] # We will append this list with the drug names that are similar to the drugs in Drug_Admin_description

for drug in df1["Drug_Admin_Description"]:
    match_test = difflib.get_close_matches(drug, drug_name, n=1)
    if len(match_test) == 0: #if the match is less then 60% similarity it will return a blank list
        pass
    else:
        match_list.append(match_test[0]) #we will take the only item in that list and append it to our match list

df3 = pd.DataFrame({"new_column":match_list}) #we will then make a dataframe of the matches.

Here is a link to the documentation below for get_close_matches. You can pass in the cutoff parameter to decide on what match % you want for each word. https://docs.python.org/2/library/difflib.html#difflib.get_close_matches

Thanks a lot @RamWill I am trying to implement it in avery slow server.Thank you for your help. — Akash Meghani, Oct 08 '19 at 14:24

Valdi_Bo · Answer 2 · 2019-10-07T17:37:34.993

One of possible solutions:

To get names from a column of a DataFrame, define the following function:

def getNames(src, colName):
    res = src.str.split(r' [\d.%]+ ?', n=1, expand=True).drop(1, 'columns')
    res.set_index(res[0].str.upper(), inplace=True)
    res.index.name = None
    res.columns = [colName]
    return res

I noticed that each drug name can contain a "numeric part" (a space, a sequence of digis, including a dot or a percent char).

So this function splits each name on this pattern and takes only the first "segment".

Then note that there are differences concerning upper / lower case, so each list of names must have the index containing the same name in upper case (so there will be possible to join both name lists just on the index).

Then call this function for both source columns:

n1 = getNames(df1.Drug_Admin_Description, 'Name')
n2 = getNames(df2.Drug_Name, 'Name2')

And to get the final result, run:

n1.join(n2, how='inner').drop('Name2', 'columns').reset_index(drop=True)

There is one diffrence compared to your desired result, namely Omega-3 Fatty Acids is the result in full text.

According to the criterion I chose, this name contains no numerical part. The only digit (3) is an integral part of the name and there are no digits after this place. So I think that there is nothing to "cut off" in this case.

Thnak You that was usefull,I am trying to implement it in very slow server and the data is pretty huge.I am still trying to implement it . — Akash Meghani, Oct 08 '19 at 14:22
Another approach worth considering is to use *fuzzy matching*. It involves *fuzzywuzzy* Python module. Even on StackOverflow you can find plenty of questions and examples. — Valdi_Bo, Oct 08 '19 at 16:17

How to compare two columns of diffrent dataframes and create a new one

2 Answers2