I have a dataset full of towns and their geographical data and some input data. This input data is, almost always, a town as well. But it being towns scraped off of the internet, they can be a little bit misspelled or spelled differently. e.g. Saint Petersburg <-> St. Petersburg
During my research I came across a couple of algorithms and tried out two. Firstly I tried
Sørensen–Dice coefficient. This gave me some promising results, until I tried to match short strings against longer strings. The algorithm is really good when all strings are roughly the same size, but when they differ a lot in size you get mixed results. e.g. When matching Saint
to the set, it will give Sail
as best match, while I want Saint-Petersburg
.
The second algo I tried is the Levenshtein distance, but for the same reasons it didn't fare well.
I came across some other algorithms such as cosine similarity, longest common subsequence and more, but those are a bit more complicated it seems and I would like to keep the cost of calculation down.
Are there any algorithms that prioritize length of match over the percentage matched?
Anyone have any experience with matching oddly spelled town names? Please let me know!
EDIT:
I thought this SO question was a possible duplicate, but it turns out it describes Sorenson-Dice.