Most efficient way to compare two panda data frame and update one dataframe based on condition

Question

I have two dataframe df1 and df2. df2 consist of "tagname" and "value" column. Dictionary "bucket_dict" holds the data from df2.

bucket_dict = dict(zip(df2.tagname,df2.value))

In a df1 there are millions of row.3 columns are there "apptag","comments" and "Type" in df1. I want to match between this two dataframes like, if

"dictionary key" from bucket_dict contains in df1["apptag"] then update the value of df1["comments"] = corresponding dictionary key and df1["Type"] = corresponding bucket_dict["key name"] . I used below code:

for each_tag in bucket_dict: 
    df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "comments"] =  each_tag
    df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "Type"] =  bucket_dict[each_tag]

Is there any efficient way to do this since it's taking longer time.

Bucketing df from which dictionary has been created:

bucketing_df = pd.DataFrame([["pen", "study"], ["pencil", "study"], ["ersr","study"],["rice","grocery"],["wht","grocery"]], columns=['tagname', 'value'])

other dataframe:

  output_df = pd.DataFrame([["test123-pen", "pen"," "], ["test234-pencil", "pencil"," "], ["test234-rice","rice", " "], columns=['apptag', 'comments','type'])

Required output:

I provided sample dataframe and expected output in the question @sammywemmy. Requested you to check the question once again — greenking, Feb 03 '20 at 08:14
you can use this a guide : https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — sammywemmy, Feb 03 '20 at 08:48
updated once again so that you can directly work with the data @sammywemmy — greenking, Feb 03 '20 at 09:13

Karan Shishoo · Accepted Answer · 2020-02-03T12:38:01.207

0

You can do this by calling an apply on your comments column along with a loc on your bucketing_df in this manner -

def find_type(a):
    try:
        return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['value'].values[0]
    except:
        return ""

def find_comments(a):
    try:
        return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['tagname'].values[0]
    except:
        return ""


output_df['type'] = output_df['apptag'].apply(lambda a: find_type(a))
output_df['comments'] = output_df['apptag'].apply(lambda a:find_comments(a))

Here I had to make them separate functions so it could handle cases where no tagname existed in apptag

It gives you this as the output_df -

           apptag comments     type
0     test123-pen      pen    study
1  test234-pencil   pencil    study
2    test234-rice     rice  grocery

All this code uses is the existing bucketing_df and output_df you provided at the end of your question.

edited Feb 03 '20 at 12:38

answered Feb 03 '20 at 09:58

Karan Shishoo

2,402
2
17
32

as per the question, I want to update based certain condition(substring match) – greenking Feb 03 '20 at 10:47
@greenking your question could do with some editing to make it a lot more clear. Is the format of the `apptag` always the same ? (does it always end in "-comment") – Karan Shishoo Feb 03 '20 at 10:52
Not at all @Karana. apptag can be anything no proper format at all. if "tagname" from the bucketing dataframe contains in the "apptag" then only update "comments" with dictionary key and type with corresponding dictionary value – greenking Feb 03 '20 at 10:58
@greenking added an edit with additional functions to account for cases where no `tagname` existed in `apptag` – Karan Shishoo Feb 03 '20 at 11:08
Thanks it worked but there is no as such performance improvement. – greenking Feb 03 '20 at 11:13
@greenking I found an issue. Due to the way its setup any apptags that "pencil" get flagged as "pens" as pen is a substring in pencil – Karan Shishoo Feb 03 '20 at 11:22
could please tell me one thing @Karan like in the function find_type where the below check is happening? If "tagname" value from bucketing df contains in the "apptag" column? Say for example if "pen" contains "test123-pen" then only call the function to update – greenking Feb 03 '20 at 11:43
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/207101/discussion-between-karan-shishoo-and-greenking). – Karan Shishoo Feb 03 '20 at 12:24

Most efficient way to compare two panda data frame and update one dataframe based on condition

1 Answers1