Steps of the Algorithm
Token_set_ratio performs the following steps:
- split sentence and remove duplicates
- create three lists of
remainder1
= words that are only in the first sentence
remainder2
= words that are only in the second sentence
intersection
= words that are in both sentences
- sort the words in the three lists and join the elements to a combined string
sorted_remainder1
sorted_remainder2
sorted_intersection
- join the strings in the following way:
combined1
= <sorted_intersection><sorted_remainder1>
combined2
= <sorted_intersection><sorted_remainder2>
- calculate the following similarities:
- fuzz.ratio(
sorted_intersection
, combined1
)
- fuzz.ratio(
sorted_intersection
, combined2
)
- fuzz.ratio(
combined1
, combined2
)
- return the maximum of those similarities
Example
For the strings user attempts login
and acceptance criteria
this leads to the following result:
remainder1 = ['user', 'attempts', 'login']
remainder2 = ['acceptance', 'criteria']
intersection = []
sorted_remainder1 = 'attempts login user'
sorted_remainder2 = 'acceptance criteria'
combined1 = 'attempts login user'
combined2 = 'acceptance criteria'
fuzz.ratio(sorted_intersection, combined1) = 0
fuzz.ratio(sorted_intersection, combined2) = 0
fuzz.ratio(combined1, combined2) = 42
In your specific case this is a similar result to fuzz.token_sort_ratio
, which only sorts the words in both sentences and compares them using fuzz.ratio
.