0

I am looking for a way to find the closest string match between two strings that could eventually have a very different size. Let's say I have, on the one hand, a list of possible locations like:

Yosemite National Park

Yosemite Valley

Yosemite National Park Lodge

Yosemite National Park Visitor Center

San Francisco

Golden Gate Park San Francisco

Paris

New York

Manhattan New York

Hong Kong

On the other hand, I have multiple sentences like:

  1. "I proposed to my wife on the 12th of November 1984, during a crazy downpour in the middle of Yosemite in California"
  2. "I love to walk my dog in Central Park, New York"
  3. "I love Hong Kong"

Now say I would like to extract the location from these set of sentences I would I proceed to do that? I know about the Levenshtein distance algorithm but I'm not quite sure it will work efficiently here, especially because I have many more locations and many more sentences to try and match. I guess what I would love to have is a matching score of some sort for each location so that I can pick the one with the highest score, but I have no idea on how to compute this score.

Do you guys have any idea of how to do that? Or perhaps even an implementation or python package?

Thanks in advance

LBes
  • 3,366
  • 1
  • 32
  • 66
  • First of all, Levenshtein distance algorithm is to compute minimum edit distance between two strings. It seems that your problem is a search problem. Also what do you want to do with the score assigned to a location? How does it help in extracting the exact location from a given sentence. May be you are not clarifying something in the question. – SomeDude Aug 24 '18 at 23:46
  • @SomeDude getting a possible exact location is exactly what I want to achieve for other purposes. Having a score was just a way to get the more likely location. Indeed it seems that my problem is more of a search problem, but I still couldn't find an algorithm that would help me with that – LBes Aug 25 '18 at 00:00
  • In all your examples a simple setence.contains(place) is enough, unclear when would you need something more complex – juvian Aug 25 '18 at 00:04
  • @juvian Examples are updated as well as possible sentences – LBes Aug 25 '18 at 00:07
  • What you want is called named entity recognition (NER). It's a complex subject – juvian Aug 25 '18 at 00:21
  • @juvian NER seems more complicated that what I am trying to achieve isn't it ? I mean I already have the dataset with all possible locations within my "corpus". – LBes Aug 25 '18 at 00:34
  • Not an expert on the subject, hopefully someone can give you a better/simpler answer – juvian Aug 25 '18 at 01:12
  • maybe you find this one a bit [similar](https://stackoverflow.com/questions/52040353/speeding-up-a-closest-string-match-algorithm) to yours – juvian Aug 27 '18 at 14:09
  • @juvian that's me asking the question ;) – LBes Aug 27 '18 at 14:10
  • Oh didn´t check that, I see :). As it had ended up using NER I thought that it wasn´t yours ^^ – juvian Aug 27 '18 at 14:12

2 Answers2

2

You may want to look at the Aho-Corasick algorithm, from the Wikipedia:

In computer science, the Aho–Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches.

In your example the dictionary of strings are the list of places and the input text are the sentences. There are several implementations in multiple languages, I recommend flashtext (Python), follow an example:

from flashtext import KeywordProcessor

keywords = ['Yosemite',
            'Yosemite National Park',
            'Yosemite Valley',
            'Yosemite National Park Lodge',
            'Yosemite National Park Visitor Center',
            'San Francisco',
            'Golden Gate Park San Francisco',
            'Paris',
            'New York',
            'Manhattan New York',
            'Hong Kong']

keyword_processor = KeywordProcessor(case_sensitive=False)
for keyword in keywords:
    keyword_processor.add_keyword(keyword)

sentences = ["I proposed to my wife on the 12th of November 1984, during a crazy downpour in the middle of Yosemite in California",
"I love to walk my dog in Central Park, New York",
"I love Hong Kong"]

for sentence in sentences:
    extracted = keyword_processor.extract_keywords(sentence)
    print(extracted)

Output

['Yosemite']
['New York']
['Hong Kong']
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • While this works very well with your added Keyword of Yosemite, it doesn't work if I remove it and I get no match for my first sentence. I don't have exact matches, in my string which is one of my problems – LBes Aug 26 '18 at 16:58
  • Still upvoted though as it may answer some problems for future users that are similar to mine but a bit simpler :) – LBes Aug 26 '18 at 17:04
  • I see, does "Yose" or "Mite" produce a match for the first sentence, or only whole world partial matches are valid? – Dani Mesejo Aug 26 '18 at 17:22
  • I suppose whole word would be better but I haven't checked my whole dataset to see if it only contains whole words – LBes Aug 26 '18 at 17:26
1

For jobs like this, you'd typically use a pipeline of processing something on this general order:

  1. remove "noise" words (aka "stop words") like "a", "an", "the", "is", and so on. If you look around a bit, you can find various lists of stop words to filter out.
  2. create a vector space model of each "document" in your corpus.
  3. Create a vector space model of a query.
  4. compute something like the TF-IDF or cosine distance between a query vector and each candidate document vector.
  5. Choose the highest score as representing the most likely match.

References

  1. Onix stop word list
  2. NLTK stop word list
  3. vector space model
  4. cosine similarity
  5. cosine similarity
  6. tf-idf
  7. tf-idf vs. cosine similarity

I should probably add that this sort of pipeline is more often used when you have a much larger number of documents, and each document individually is considerably larger as well. Since the "documents" and "queries" are represented exactly the same way, it's also useful/usable for cases where you want to categorize and group documents--that is, find how similar documents are to each other.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • I thought about doing something like that, i.e., first removing stop words, but figured it might not be enough and could instead probably try to use https://nlp.stanford.edu/software/CRF-NER.shtml to try and get possible locations right away. I am not sure which one is more efficient but I think the NER path is more likely to remove all unnecessary words don't you think? Also, when talking about the size I have a 500MB file of possible locations and a 2GB file of data, so I do have some data available. – LBes Aug 26 '18 at 17:03
  • @LBes: Given its complexity (looks quite a bit higher to me) I'd certainly hope it worked better--but I haven't done enough testing with it to be sure how it really compares. – Jerry Coffin Aug 26 '18 at 17:27
  • Yes the complexity is indeed higher but the fact is that I have such long sentences sometimes and so many of them that getting only a possible location that is as short as possible will definitely help. Of course I can't be sure that I am not going to miss out on some important bit of information by applying this algorithm though. In any case, I upvoted your answer as I think, while it might not be optimal for my case, it might solve other users' possible issues in the future – LBes Aug 26 '18 at 17:30
  • Going to accept the answer anyways, I think it's the closest a possible answer. – LBes Aug 27 '18 at 13:02