0

I am trying to perform string matching between two pandas dataframe.

df_1:
ID   Text           Text_ID
1    Apple            53
2    Banana           84
3    Torent File      77

df_2: 
ID   File_name      
22   apple_mask.txt
23   melon_banana.txt
24   Torrent.pdf
25   Abc.ppt

Objective: I want to populate the Text_ID against File_name in df_2 if the string in df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[Text_ID] as -1. So the resultant df` looks like

ID   Flie_name           Text_ID
22   apple_mask.txt        53
23   melon_banana.txt      84
24   Torrent.pdf           77          
25   Abc.ppt               -1

I have tried this SO thread, but it is giving a column where File_name wise fuzz score is listed.

I am trying out a non fuzzy way. Please see below the code snippets:

text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
  if str(j) in str(i):
    t_i = df_1.loc[df_1['Text']==i,'Text_ID']
    text_id.append(t_i)
  else:
    t_i = -1
    text_id.append(t_i)
df_2['Text_ID'] = text_id

But I am getting a blank text_id list.

Can anybody provide some clue on this? I am OK to use fuzzywuzzy as well.

pythondumb
  • 1,187
  • 1
  • 15
  • 30

2 Answers2

1

You can get it with the following code:

df2['Text_ID'] = -1    # set -1 by default for all the file names
for _,file_name in df2.iterrows():
    for _, text in df1.iterrows():     
        if text[0].lower() in file_name[0]:  # compare strings
            df2.loc[df2.File_name == file_name[0],'Text_ID'] = text[1] # assaign the Text_ID from df1 in df2
            break

Keep in mind:

  • String comparison: As it is now working, apple and banana are contained in apple_mask.txt and melon_banana.txt, but torrent file is not in torrent.pdf. Consider redefining the string comparison.
  • df.iterrows() returns two values, the index of the row and the values of the row, in this case I have replaced the index by _ since it is not necessary to solve this problem

result:

df2
          File_name  Text_ID
0   apple_mask.text       53
1  melon_banana.txt       84
2       Torrent.pdf       -1
3           Abc.ppt       -1
fullfine
  • 1,371
  • 1
  • 4
  • 11
0

You can try following code:

text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
      if j.lower().find(i.lower()) == -1:
        t_i = -1
        df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
      else:
        t_i = df_1.loc[df_1['Text']==i,'Text_ID']
        df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
SAL
  • 597
  • 3
  • 17
  • Not quite sure about this line `if j.find(i) == -1:`. Please note there is no entry in df_2 as -1. Also you seemed to have missed the fact that `df_1['Text']` and `df_2['File_name']` are not exactly same. – pythondumb Dec 07 '20 at 12:34
  • Refer following answer for further clarification. https://stackoverflow.com/a/27138045/3417134 – SAL Dec 07 '20 at 12:37
  • I have updated the code to convert df_1['Text'] and df_2['File_name'] to lower just for sake to comparison. – SAL Dec 07 '20 at 12:42
  • Your method might fail, if there are (1) blanks in df_2 (2) repeat of same file name in df_2. I just checked with my actual example and it is performing correct for the 1st match. – pythondumb Dec 07 '20 at 13:22