0

I have a dataset full of towns and their geographical data and some input data. This input data is, almost always, a town as well. But it being towns scraped off of the internet, they can be a little bit misspelled or spelled differently. e.g. Saint Petersburg <-> St. Petersburg

During my research I came across a couple of algorithms and tried out two. Firstly I tried Sørensen–Dice coefficient. This gave me some promising results, until I tried to match short strings against longer strings. The algorithm is really good when all strings are roughly the same size, but when they differ a lot in size you get mixed results. e.g. When matching Saint to the set, it will give Sail as best match, while I want Saint-Petersburg. The second algo I tried is the Levenshtein distance, but for the same reasons it didn't fare well.

I came across some other algorithms such as cosine similarity, longest common subsequence and more, but those are a bit more complicated it seems and I would like to keep the cost of calculation down.

Are there any algorithms that prioritize length of match over the percentage matched?

Anyone have any experience with matching oddly spelled town names? Please let me know!

EDIT:

I thought this SO question was a possible duplicate, but it turns out it describes Sorenson-Dice.

  • I think lack of accuracy is inherent in this kind of problem. There is no specific rule in the mismatch so you can implement a specific rule to match it. If the project is big enough, train an AI maybe..? – Vaibhav Garg Apr 05 '18 at 09:19
  • I wouldn't want to write rules for every town, that seems a bit inefficient? An AI would be able to do this, but sadly this is only a very small part of a project, so it isn't really doable. – Adriaan Vermeire Apr 05 '18 at 09:22
  • There is no algorithm that could possibly know that ”saint” is closer to ”st” than ”sail”. You have to treat names like that as special cases. – JJJ Apr 05 '18 at 09:25
  • Matching Saint to St isn't what I want though. That wouldn't be possible without extra rules, as you said. I want to prioritize longer matches above percentage of match. – Adriaan Vermeire Apr 05 '18 at 09:29

0 Answers0