5

First, I'm not looking for the actual fuzzy matching algorithm. We're using both Dice's Coefficient and Levenshtein Distance. I'm looking for the cleverest way to utilize these algorithms.

The Goal:

I'm trying to detect the names of cities in paragraph of text, in the order they occur. We have a list of ~1 million location names. I want to search through a paragraph of text, and detect when one of these locations are present, then store that city. Location names can be a single or multiple words.

Example Paragraph:

Hi Mom! Sam and I are thinking of road tripping through Canada in the next month. We know we can already stay at John's house in Quebec City. I know you have traveled a lot in Canada, so I wanted to get your advice.

Like I said, we'd start in Quebec city, then probably drive to Miramichi before heading to Halifax. After 2 days we want to go to Cape Breton. Finally, we want to check out Advocate Harbor to see things like the Bay of Fundy, Digby, and the Pier of St. Elizabeth

Talk to you soon!

Expected Results

  • Canada
  • Quebec City
  • Canada
  • Miramichi
  • Halifax
  • Cape Breton
  • Advocate Harbor
  • Bay of Fundy
  • Digby
  • Pier of St. Elizabeth

The Problem

My current roadblock is how to detect location names with multiple words. I know I can split the paragraph into words, then compare them against my list, like:

  1. Fuzzy match the first word against my list of location names
  2. If no match, fuzzy match (first word + second word) against my list of location names
  3. If no match, fuzzy match (first + second + third word) against my list of location names
  4. ...etc

This is my current approach, but it is incredibly slow and inefficient. Is there a clever way I can accomplish what I'm looking for?

CHawk
  • 1,346
  • 5
  • 22
  • 40
  • 1
    Can the paragraph be treat like a single line of string, and use some kind of string matching algorithm? such as https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm to match multiple pattern (locations in your case) – shole Jun 23 '16 at 01:45
  • Yes, this is exactly what I was looking for. It doesn't do Fuzzy matching but worked perfectly. Submit this as an answer and I'll mark it as correct. – CHawk Jun 23 '16 at 05:59
  • Thanks. Glad to know that it helps :) – shole Jun 23 '16 at 06:03

1 Answers1

2

I think some string matching algorithm works perfectly well for you,

Here is a list for them: String Matching Algorithms

In your case, I think you need multiple pattern string matching one, such as Aho–Corasick algorithm

shole
  • 4,046
  • 2
  • 29
  • 69
  • 1
    This works great! As a reference for others, I ended up using the Aho-Corasick implementation from this gem: https://github.com/ahnick/ahocorasick – CHawk Jun 23 '16 at 06:07