1

Problem:

Given a query string Q find the best fitting patterns from a collection of patterns.

Example 1:

Patterns = ["tomato", "tomato purée", "cayenne pepper", "pepper"]

Query = "I like tomato purée but I do not like pepper, but cayenne pepper is fine.

Desired Result = ["tomato purée", "pepper", "cayenne pepper"]

Example 2:

Patterns = ["crushed tomatoes", "tomatoes", "buckwheat", "flour", "buckwheat flour"]

Query = "Don't forget to buy some crushed tomatoes and buckwheat flour at the store!"

Desired Result = ["crushed tomatoes", "buckwheat flour"]

Example 3:

Patterns = ["crushed tomatoes" "tomatoes", "buckwheat", "flour", "buckwheat flour"]

Query = "Don't forget to buy some crushed tomats and buckwheet flour at the store!"

Desired Result = ["crushed tomatoes", "buckwheat flour"]

What I have tried:

I have tried to create a trie out of my patterns and then tokenize my query string at every space, so I get all words individually. Then see if each word exists in the trie. There are two problems I am facing with this:

  1. I am not sure how to efficiently include fuzzy/approximate matching with the trie data structure.
  2. I am getting individual word matches, but missing multi-word matches. Consider example 2: I only find "tomatoes" but the best fitting match is "crushed tomatoes", and I match "buckwheat" and "flour" but the appropriate match should be "buckwheat flour".

Note: Problem 2 is what I think primarily sets my question apart from other questions on Stack Overflow. A very similar question is: Algorithm for Fuzzy Matching Multiple Word Phrases in a Paragraph. However the solution recommends using Aho-Corasick algorithm but it does NOT do fuzzy/approximate matching! I basically need Fuzzy-Aho-Corasick, if there is something like that.

I have thought about solving it by not tokenizing the query string, and then walking through every character in the query string and whenever a match is found I reset the trie-walker. One problem I think will arise is with fuzzy/approximate matching: the edit distance will be "used up" more quickly if for example the query is "krushed tomattoes". The allowed edit distance would have to be higher for all words since I assume I can't know before I matched the word whether or not it is a multi-word match. Therefore I reduce the accuracy for all short words too. Perhaps this is of no concern, or easily avoidable/fixed - I don't know.

Any ideas on how to solve this problem efficiently? The querying needs to be fast, but the preprocessing for building the search index can be however slow it needs to be.

LeonHE
  • 11
  • 2
  • I would solve this problem by finding [the Levenshtein distance to each substring](https://stackoverflow.com/questions/8139958/algorithm-to-find-edit-distance-to-all-substrings) that has the same length as the pattern string. The answer would be the pattern string with the closest matching substring. – Anderson Green Apr 22 '22 at 21:58

0 Answers0