Check if a text string in one excel file contains in another excel file

Question

I've just started to learn Python , and i have two excel files which have different shape the first with 225 rows and the second with 500 . the task will be to compare a text from a specific column (Num 3) in file1 and compare it with the column (Num 3) also in file2, and if there's a match then show the highest percentage of the matching if there's no match show "No match"

Can any one give me an advise about that?

Example

Welcome to stackoverflow. When asking question about data in combination with `pandas`, it is advised to add your data in a way so we can copy and paste it ourself so we can reproduce an answer for you. Read more [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) on how to write a good pandas question. Furthermore, try not to post [pictures](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) of data or code. — Erfan, Jul 25 '19 at 10:18
read the excel in two separate dataframes, merge the two dataframes on column(num 3) with join = inner. then you'll be having two columns for num3 ,i.e num3_x and num3_y , form another column where you calculate the leveshtein or fuzzywuzzy ratio and keep those columns above certain threshold and add 'no matxh for' otherwise. — Madhur Yadav, Jul 25 '19 at 10:24
Thanks ! the "inner" and "merge" method really helped me , but i still don't know how to calculate the percentage @MadhurYadav — Yasoo, Jul 25 '19 at 13:20
Install fuzzy-wuzzy python library. from fuzzy-wuzzy import fuzz. fuzz.ratio("mango","manga") will give you a score of 95 out of 100 in similarity matching....keep a threshold of say 75-80 , if the match is below 80 then no match, otherwise match — Madhur Yadav, Jul 25 '19 at 15:05
please type in the code that you have written as an answer !! it will help others for reference !! — Madhur Yadav, Jul 26 '19 at 12:47

Yasoo · Answer 1 · 2019-07-31T11:05:46.720

 import pandas as pd 
 import numpy as np
 from fuzzywuzzy import fuzz,process

 def match(x, y,  min_score=0):
    # -1 in case we don't get any match    
    max_score= -1
    max_text = ''
    for row2 in y:
    #finding fuzzy match score
    score = fuzz.ratio(x, row2)

    #checking if we are above our threshold and have a better score
    if (score > min_score) & (score > max_score):
        max_score = score
        max_text = row2

  return (max_score, max_text)

    #read the files
    pd.options.display.max_columns = 10

    #read only the 3rd column form both excel files
    wb1 = pd.read_excel('Excel1.xlsx', 'Sheet_name', na_values=['NA'], usecols =               [2])
    wb2 = pd.read_excel('Excel2.xlsx', 'Sheet_name', na_values=['NA'], usecols = [2])


    diff = pd.concat((wb1, wb2),  axis = 1)

    #add a new column to the DataFrame called "match"
    diff['match'] = np.zeros((len(diff), ))




  for i, row in enumerate(wb1['col_name']):
      score, text = match(row, wb2['col2_name'])
      print(score)
      diff.iloc[i, 1] = text
      diff.iloc[i, 2] = score


    diff.to_excel("output.xlsx")

Check if a text string in one excel file contains in another excel file

1 Answers1