0

I've just started to learn Python , and i have two excel files which have different shape the first with 225 rows and the second with 500 . the task will be to compare a text from a specific column (Num 3) in file1 and compare it with the column (Num 3) also in file2, and if there's a match then show the highest percentage of the matching if there's no match show "No match"

Can any one give me an advise about that?

Example

Yasoo
  • 1
  • 2
  • 2
    Welcome to stackoverflow. When asking question about data in combination with `pandas`, it is advised to add your data in a way so we can copy and paste it ourself so we can reproduce an answer for you. Read more [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) on how to write a good pandas question. Furthermore, try not to post [pictures](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) of data or code. – Erfan Jul 25 '19 at 10:18
  • read the excel in two separate dataframes, merge the two dataframes on column(num 3) with join = inner. then you'll be having two columns for num3 ,i.e num3_x and num3_y , form another column where you calculate the leveshtein or fuzzywuzzy ratio and keep those columns above certain threshold and add 'no matxh for' otherwise. – Madhur Yadav Jul 25 '19 at 10:24
  • Thanks ! the "inner" and "merge" method really helped me , but i still don't know how to calculate the percentage @MadhurYadav – Yasoo Jul 25 '19 at 13:20
  • Install fuzzy-wuzzy python library. from fuzzy-wuzzy import fuzz. fuzz.ratio("mango","manga") will give you a score of 95 out of 100 in similarity matching....keep a threshold of say 75-80 , if the match is below 80 then no match, otherwise match – Madhur Yadav Jul 25 '19 at 15:05
  • And if you want exact match then keep the threshold to 100. – Madhur Yadav Jul 25 '19 at 15:05
  • U saved me , really thanks @MadhurYadav – Yasoo Jul 26 '19 at 10:58
  • please type in the code that you have written as an answer !! it will help others for reference !! – Madhur Yadav Jul 26 '19 at 12:47

1 Answers1

0
 import pandas as pd 
 import numpy as np
 from fuzzywuzzy import fuzz,process

 def match(x, y,  min_score=0):
    # -1 in case we don't get any match    
    max_score= -1
    max_text = ''
    for row2 in y:
    #finding fuzzy match score
    score = fuzz.ratio(x, row2)

    #checking if we are above our threshold and have a better score
    if (score > min_score) & (score > max_score):
        max_score = score
        max_text = row2

  return (max_score, max_text)

    #read the files
    pd.options.display.max_columns = 10

    #read only the 3rd column form both excel files
    wb1 = pd.read_excel('Excel1.xlsx', 'Sheet_name', na_values=['NA'], usecols =               [2])
    wb2 = pd.read_excel('Excel2.xlsx', 'Sheet_name', na_values=['NA'], usecols = [2])


    diff = pd.concat((wb1, wb2),  axis = 1)

    #add a new column to the DataFrame called "match"
    diff['match'] = np.zeros((len(diff), ))




  for i, row in enumerate(wb1['col_name']):
      score, text = match(row, wb2['col2_name'])
      print(score)
      diff.iloc[i, 1] = text
      diff.iloc[i, 2] = score


    diff.to_excel("output.xlsx")
Yasoo
  • 1
  • 2