0

so the problem is. I have wrote a script that compare values in dataPhrame using fuzzywuzzy

def check_match_principal_name(state):
    for i in range(len(ALL_SCHOOLS['Principal Name'])):
        for a in range(len(TOP100['Principal'])):
            matchADD = fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
            if matchADD > 90:
                print(ALL_SCHOOLS['Principal Name'][i]+' '+TOP100['Principal'][a])
                matchPRI.append(i)
                matchPRI100.append(a)
                print(ALL_SCHOOLS['Principal Name'][i])
                print(TOP100['Principal'][a])
    for i in matchPRI:
        ALL_SCHOOLS.loc[i, 'MatchPRI'] = 1

    for i in matchPRI100:
        TOP100.loc[i, 'MatchPRI'] = 1

    ALL_SCHOOLS.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/{state}1.xlsx')
    TOP100.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/top-100/{state}1.xlsx')
    matchPRI.clear()
    matchPRI100.clear()

it works, I don't have any exceptions and etc. but for example in upper script fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a]) returns Kimberly Beukema - Ms. Kimberly Beukema = 91

and in second script like this:

from fuzzywuzzy import fuzz
match= fuzz.partial_token_sort_ratio('Kimberly Beukema','  Ms. Kimberly Beukema')
print(match)

it returns match = 100

and I don't understand why the value is changing?

Cristina
  • 29
  • 3
  • 1
    `token_sort_ratio` and `partial_token_sort_ratio` are two different functions. The latter one matches on the shortest string so has a 100% match. You can read up on it [here](https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings). – RJ Adriaansen Feb 09 '21 at 13:46

1 Answers1

0

Both token_sort_ratio and partial_token_sort_ratio preprocess the two strings by default. This means it lowercases the strings, removes non alphanumeric characters and trims whitespaces. So in your case it converts:

'Kimberly Beukema'
'  Ms. Kimberly Beukema'

to

'kimberly beukema'
'ms kimberly beukema'

In the next step they both sort the words in the two strings:

'beukema kimberly'
'beukema kimberly ms'

Afterwards they compare the two strings. For this comparision token_sort_ratio uses ratio, while partial_token_sort_ratio uses partial_ratio.

In ratio 3 deletions are required to convert 'beukema kimberly ms' to 'beukema kimberly'. Since the strings have a combined length of 35 the resulting ratio is round(100 * (1 - 3 / 35)) = 91.

In partial_ratio the ratio of the optimal alignment of the two strings is calculated. In your case 'beukema kimberly' is a substring of 'beukema kimberly ms', so the ratio between 'beukema kimberly' and 'beukema kimberly' is calculated which is round(100 * (1 - 0 / 32)) = 100.

maxbachmann
  • 2,862
  • 1
  • 11
  • 35