0

I have 2 dataframes across which I need to map the keywords. The input data(df1) looks like this:

    keyword            subtopic     
    post office        Brand        
    uspshelp uspshelp  Help         
    package delivery   Shipping     
    fed ex             Brand        
    ups fedex          Brand        
    delivery done      Shipping     
    united states      location     
    rt ups             retweet      

This is the other dataframe (df2) which is to be used for keyword matching:

Key     Media_type  cleaned_text
910040  facebook    will take post office
409535  twitter     need help with upshelp upshelp
218658  facebook    there no section post office alabama ups fedex
218658  facebook    there no section post office alabama ups fedex
518903  twitter     cant wait see exactly ups fedex truck package
2423281 twitter     fed ex messed seedless
763587  twitter     crazy package delivery rammed car
827572  twitter     formatting idead delivery done
2404106 facebook    supoused mexico united states america
1077739 twitter     rt ups

I want to map the 'keyword' column in df1 to the 'cleaned_text' column in df2 based on few conditions:

  1. One row in 'keyword' can be mapped to more than one row in 'cleaned_text' (One to many relationship)
  2. It should select the whole keyword together and not just individual words.
  3. If a 'keyword' matches to more than one row in 'cleaned_Text' it should create new records in the output dataframe(df3)

This is how the output dataframe(df3) should look like:

Key     Media_type  cleaned_text                                    keyword               subtopic  
910040  facebook    will take post office                           post office           Brand 
409535  twitter     need help with upshelp upshelp                  uspshelp uspshelp     Help  
218658  facebook    there no section post office alabama ups fedex  post office           Brand 
218658  facebook    there no section post office alabama ups fedex  ups fedex             Brand 
518903  twitter     cant wait see exactly ups fedex truck package   ups fedex             Brand 
2423281 twitter     fed ex messed seedless                          fed ex messed         Brand 
763587  twitter     crazy package delivery rammed car               package delivery      Shipping  
827572  twitter     formatting idead delivery done                  delivery done         Shipping  
2404106 facebook    supoused mexico united states america           united states america location  
1077739 twitter     rt ups                                          rt ups                retweet               
halfer
  • 19,824
  • 17
  • 99
  • 186
  • Does this answer your question? [Python Pandas - Merge based on substring in string](https://stackoverflow.com/questions/48743662/python-pandas-merge-based-on-substring-in-string) – G. Anderson Feb 11 '20 at 20:59
  • It is not able to map all the rows correctly. It does the job partially. @G.Anderson – Yashvardhan Rathi Feb 11 '20 at 22:48

1 Answers1

1

How about converting your df1 into a dictionary? And then loop through your df2 and search for matches. It is maybe not the most efficient way, but it is very readable

keyword_dict = {row.keyword: row.subtopic for row in df1.itertuples()}
df3_data = []
for row in df2.itertuples():
    text = row.cleaned_text
    for keyword in keyword_dict:
        if keyword in text:
            df3_row = [row.Key, row.Media_type, row.cleaned_text, keyword, keyword_dict[keyword]]
            df3_data.append(df3_row)

df3_columns = list(df2.columns) + list(df1.columns)
df3 = pd.DataFrame(df3_data, columns=df3_columns)
chatax
  • 990
  • 3
  • 17
  • It does the job but I have a couple of doubts: 1. Why is the row with key '218658' getting duplicated? It should only be present two times instead of four times. 2. In key '2404106' and '423281' the keyword column is showing only 2 keywords instead of 3 keywords. Thank you for the answer. Please help me with the doubts as well. – Yashvardhan Rathi Feb 11 '20 at 21:34
  • You shouldn't need to convert dataframes to dictionaries to do this. – m13op22 Feb 11 '20 at 21:54
  • Your keywords df1 only contains max. 2 words. So you can fix that by updating that dataframe. In my code there are only two occurances of key '218658', so I don't know why that is happening. – chatax Feb 11 '20 at 22:02
  • Yes, I was able to fix that by updating the dataframe..Although I am still getting duplicate rows for '218658'. Not sure why. Let me check with some more data and verify it. @chatax – Yashvardhan Rathi Feb 11 '20 at 22:13
  • @HS-nebula can you please share the answer without dictionaries? – Yashvardhan Rathi Feb 11 '20 at 22:14
  • What if I have more than 2 columns in df1? @chatax – Yashvardhan Rathi Feb 12 '20 at 15:56
  • Than you make sure that the keyword dictionary values are lists of this other row values. Than you concat the dicionary value with the df2 row. – chatax Feb 13 '20 at 14:49
  • I was able to find another way to do it. Thanks for your input. Can you upvote my question? @chatax – Yashvardhan Rathi Feb 13 '20 at 17:42
  • What is your solution? – chatax Feb 14 '20 at 10:36
  • I just used another dictionary. Actually I have one more doubt. There are some changes to the input file. I was thinking of posting a separate question for that. @chatax – Yashvardhan Rathi Feb 24 '20 at 19:37
  • I posted the doubt in another question [here](https://stackoverflow.com/questions/60383365/is-there-a-way-to-do-keyword-matching-in-pandas-and-get-a-dataframe) :v Please check if you can. @chatax – Yashvardhan Rathi Feb 24 '20 at 20:28