0

I'm writing a tool to index a document. I have a long list of hundreds, perhaps thousands of fixed patterns. E.g. my index might look like {"cat training":"p.27", "cat handling":"p.29", "cat":"p.15", "dog training":"p.62", "dog":"p.60"} and so on.

Now I want to search my text (for the sake of argument, each paragraph is a single string) for all instances of any substring in my index. (During the search, I'll sort my keys by length, as shown, so that "cat training" would match before "cat").

One more complication is that I want the matches to occur on word boundaries. I.e. I don't want "catch" to match "cat".

Is there a pythonic way to do this? My current solution is to scan through the source string, word-by-word, and then try to match the start of the string against my entire index. It works, but it's pretty slow.

Edward Falk
  • 9,991
  • 11
  • 77
  • 112
  • this looks like a feature implemented in ide's, giving suggestions from text from long existing ccode , u need little hack on that part – sahasrara62 Oct 25 '19 at 22:42
  • 1
    Similar problem: [Matching against a large number of patterns](https://stackoverflow.com/questions/7049894/how-to-efficiently-match-an-input-string-against-several-regular-expressions-at). Code: [Aho–Corasick Module](https://pypi.org/project/pyahocorasick/). For [modifying for word boundaries](https://stackoverflow.com/questions/14444738/aho-corasick-text-matching-on-whole-words) – DarrylG Oct 25 '19 at 22:53
  • @DarrylG, I won't pretend to really understand the algorithm, but it can be modified to do what I want. Make your comment into an answer, and I'll mark it as answered. – Edward Falk Oct 27 '19 at 18:53
  • @EdwardFalk--glad it was helpful. I made my comments into an answer. – DarrylG Oct 27 '19 at 20:22
  • 1
    @EdwardFalk--ran into a [simpler approach](https://stackoverflow.com/questions/58625068/reduce-states-to-abbreviations/58626846#58626846). In the lambda expression in routine multiple_replace, you can replace mo.string.title with mo.string.lower for your purpose. Lookup would be the dictionary {"cat training":"p.27", ... Not sure about performance. – DarrylG Oct 30 '19 at 15:54
  • OK, that approach involves building one giant regex from all of the patterns and then using regex.sub() with the result. It's fairly elegant, but my dictionary contains many thousands of patterns, so I'm kind of afraid to try it. – Edward Falk Nov 06 '19 at 20:06

2 Answers2

1

Aho-Corasick algorithm was developed to tackle this type of problem.

It was used to answer a previous Stack Overflow question about matching a large number of patterns.

Python library for Aho–Corasick.

Procedure for modifying Aho-Corasick algorithm for word boundaries

DarrylG
  • 16,732
  • 2
  • 17
  • 23
0

In the interest of giving back to the community, here is my implementation of Aho-Corasick in Python. I release this to the public domain.

class AhoCorasick(object):
  """Aho-Corasick algorithm. Searches a string for any of
  a number of substrings.

  Usage: Create a list or other iterator of (needle, value) pairs.
      aho_tree = AhoCorasick(needlevaluelist)
      results = aho_tree.findAll(haystack)
      for result in results:
        # Each result is a tuple: (index, length, needle, value)

  values can be literally anything.

  Author: Edward Falk
  """
  def __init__(self, patternlist=None):
    self.root = None
    if patternlist:
      self.buildStateMachine(patternlist)
  def buildStateMachine(self, patternlist):
    root = self.__buildTree(patternlist)
    queue = []
    for node in root.goto.itervalues():
      queue.append(node)
      node.fail = root
    while queue:
      rnode = queue.pop(0)
      for key, unode in rnode.goto.iteritems():
        queue.append(unode)
        fnode = rnode.fail
        while fnode != None and key not in fnode.goto:
          fnode = fnode.fail
        unode.fail = fnode.goto[key] if fnode else root
        unode.output += unode.fail.output
    return root
  def findAll(self, string, start=0):
    '''Search this string for items in the dictionary. Return a list of
    (index, len, key, value) tuples.'''
    node = self.root
    for i,ch in enumerate(string[start:]):
      while node is not None and ch not in node.goto:
        node = node.fail
      if node is None:
        node = self.root
        continue
      node = node.goto[ch]
      for word,value in node.output:
        l = len(word)
        yield (i-l+1, l, word, value)
  def __buildTree(self, patternlist):
    """patternlist is a list (or any iterable) of (string,value) pairs."""
    root = AhoCorasick.Node()
    for word,value in patternlist:
      node = root
      for ch in word:
        if ch not in node.goto:
          node.goto[ch] = AhoCorasick.Node()
        node = node.goto[ch]
      node.output.append((word,value))
    self.root = root
    return root

  class Node(object):
    '''Aho-Corasick algorithm. Each node represents a state in the
    state machine.'''
    def __init__(self):
      self.goto = {}        # Map input to next state
      self.fail = None      # Map state to next state when character doesn't match
      self.output = []      # Map state to all index entries for that state
    def __repr__(self):
      return '<Node: %d goto, %d output>' % \
        (len(self.goto), len(self.output))
    def dump(self, name, indent):
      print "%s%s: AhoCorasickNode: %d goto, output %s, fail=%s" % \
        ("  "*indent, name, len(self.goto), self.output, self.fail)
      for k,v in self.goto.iteritems():
        v.dump(k, indent+1)
Edward Falk
  • 9,991
  • 11
  • 77
  • 112