string matching - best distance algorithm to use

Question

I have two dataframes, df1 and df2, that have information about polling stations. The dataframes are of different lengths. Both dataframes have a column called ps_name, which is the name of the polling stations, and a column called district that indicates which district the polling stations are located.

I am trying to match strings on the ps_name column while blocking on the district column, so I can copy a geolocations (latitude and longitude) column on matches from df1 to df2.

So far I've tried using jaro-winkler at threshold 0.88 to compare strings.

# Matched:
**df1:** AGRICULTURAL OFFICE ATTOCK (MALE) I (P)
**df2:** AGRICULTURAL OFFICE ATTOCK (MALE) (P)

# Did not match:
**df1:** govt girls high school peoples colony attock ii
**df2:** high school peoples colony attock ii

What string distance algorithm should I be using? I've tried jaro-winkler and was also considering smith-waterman.

I have another question: one of the pairs of polling stations that are getting matched while they shouldn't be matched have the following format: 'government high school X', 'government high school Y'. They are being matched because X and Y are short words. I was thinking of separating the string out into 'government high school' and 'X', 'Y' to match 'X' and 'Y' instead of the entire strings, but I wonder how this could work for the rest of the pairs that don't have this issue. — dmswjd, Jul 22 '20 at 15:26

score 0 · Answer 1 · answered Jul 22 '20 at 03:40

0

One option is to use Levenshtein distance which is implemented in the package fuzzywuzzy (or here), the algorithm runs in O(n + d^2), where n is the length of the longer string and d is the edit distance.

Example:

from fuzzywuzzy import fuzz
fuzz.ratio('govt girls high school peoples colony attock ii','high school peoples colony attock ii') 
#87
fuzz.ratio('AGRICULTURAL OFFICE ATTOCK (MALE) I (P)', 'AGRICULTURAL OFFICE ATTOCK (MALE) (P)')
#97

answered Jul 22 '20 at 03:40

ExplodingGayFish

2,807
1
5
14

how do you come to the result, that the Levenshtein distance can be calculated in O(n + d^2)? Shouldn't it be O(N*M), since it has to iterate over both strings in a nested loop? – maxbachmann Jul 22 '20 at 14:31
1

@maxbachmann see here https://stackoverflow.com/questions/4057513/levenshtein-distance-algorithm-better-than-onm – ExplodingGayFish Jul 23 '20 at 14:51

string matching - best distance algorithm to use

1 Answers1