First, I'm not looking for the actual fuzzy matching algorithm. We're using both Dice's Coefficient and Levenshtein Distance. I'm looking for the cleverest way to utilize these algorithms.
The Goal:
I'm trying to detect the names of cities in paragraph of text, in the order they occur. We have a list of ~1 million location names. I want to search through a paragraph of text, and detect when one of these locations are present, then store that city. Location names can be a single or multiple words.
Example Paragraph:
Hi Mom! Sam and I are thinking of road tripping through Canada in the next month. We know we can already stay at John's house in Quebec City. I know you have traveled a lot in Canada, so I wanted to get your advice.
Like I said, we'd start in Quebec city, then probably drive to Miramichi before heading to Halifax. After 2 days we want to go to Cape Breton. Finally, we want to check out Advocate Harbor to see things like the Bay of Fundy, Digby, and the Pier of St. Elizabeth
Talk to you soon!
Expected Results
- Canada
- Quebec City
- Canada
- Miramichi
- Halifax
- Cape Breton
- Advocate Harbor
- Bay of Fundy
- Digby
- Pier of St. Elizabeth
The Problem
My current roadblock is how to detect location names with multiple words. I know I can split the paragraph into words, then compare them against my list, like:
- Fuzzy match the first word against my list of location names
- If no match, fuzzy match (first word + second word) against my list of location names
- If no match, fuzzy match (first + second + third word) against my list of location names
- ...etc
This is my current approach, but it is incredibly slow and inefficient. Is there a clever way I can accomplish what I'm looking for?