2

How do I test whether a phrase is in a large (650k) list of phrases when that list includes special categories?

For instance, I want to test if the phrase ["he", "had", "the", "nerve"] is in the list. It is, but under ["he", "had", "!DETERMINER", "nerve"] where "!DETERMINER" is the name of a wordclass that contains several choices (a, an, the). I have about 350 wordclasses and some of them are quite lengthy, so I don't think it would be feasible to enumerate each item in the list that has one (or more) wordclasses.

I would like to use a set of these phrases instead of slowly working my way through a list, but I don't know how to deal with the variability of the wordclasses. Speed is pretty important, since I need to make this comparison hundreds of thousands of times per go.

KillianDS
  • 16,936
  • 4
  • 61
  • 70

3 Answers3

1

Similar to pjwerneck's suggestion, you could use a tree (or more specifically a trie) to store the lists in parts, but extend it to treat the categories specially.

# phrase_trie.py

from collections import defaultdict

CATEGORIES = {"!DETERMINER": set(["a","an","the"]),
              "!VERB": set(["walked","talked","had"])}

def get_category(word):
    for name,words in CATEGORIES.items():
        if word in words:
            return name
    return None

class PhraseTrie(object):
    def __init__(self):
        self.children = defaultdict(PhraseTrie)
        self.categories = defaultdict(PhraseTrie)

    def insert(self, phrase):
        if not phrase: # nothing to insert
            return

        this=phrase[0]
        rest=phrase[1:]

        if this in CATEGORIES: # it's a category name
            self.categories[this].insert(rest)
        else:
            self.children[this].insert(rest)

    def contains(self, phrase):
        if not phrase:
            return True # the empty phrase is in everything

        this=phrase[0]
        rest=phrase[1:]

        test = False

        # the `if not test` are because if the phrase satisfies one of the
        # previous tests we don't need to bother searching more

        # allow search for ["!DETERMINER", "cat"]
        if this in self.categories: 
            test = self.categories[this].contains(rest)

        # the word is literally contained
        if not test and this in self.children:
            test = self.children[this].contains(rest)

        if not test:
            # check for the word being in a category class like "a" in
            # "!DETERMINER"
            cat = get_category(this)
            if cat in self.categories:
                test = self.categories[cat].contains(rest)
        return test

    def __str__(self):
        return '(%s,%s)' % (dict(self.children), dict(self.categories))
    def __repr__(self):
        return str(self)

if __name__ == '__main__':
    words = PhraseTrie()
    words.insert(["he", "had", "!DETERMINER", "nerve"])
    words.insert(["he", "had", "the", "evren"])
    words.insert(["she", "!VERB", "the", "nerve"])
    words.insert(["no","categories","here"])

    for phrase in ("he had the nerve",
                   "he had the evren",
                   "she had the nerve",
                   "no categories here",
                   "he didn't have the nerve",
                   "she had the nerve more"):
        print '%25s =>' % phrase, words.contains(phrase.split())

Running python phrase_trie.py:

         he had the nerve => True
         he had the evren => True
        she had the nerve => True
       no categories here => True
 he didn't have the nerve => False
   she had the nerve more => False

Some points about the code:

  • The use of defaultdict is to avoid having to check if that sub-trie exists before calling insert; it is automatically created and initialised when needed.
  • If there are going to be a lot of calls to get_category, it might be worth constructing a reverse look-up dictionary for speed. (Or, even better, memoise the calls to get_category so that common words have fast look-ups but you don't waste the memory storing words you never look up.)
  • The code assumes that each word is in only one category. (If not, the only changes are get_category returning a list and the relevant section of PhraseTrie looping through this list.)
huon
  • 94,605
  • 21
  • 231
  • 225
0

As a start, make two dicts:

partOfSpeech = {'a':'!DETERMINER', 'an':'!DETERMINER', 'the':'!DETERMINER'}
words = {'!DETERMINER': set(['a', 'an', 'the'])}

This should - at the very least - get you some speed up. Let's see how much of a speedup this gets you and if it's not enough, leave a comment, and I'll try to do better (or others from the SO community might even be able to improve my solution or offer better ones altogether).

inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • I think my main problem with the speed isn't whether the wordclasses are in a list or a dictionary but whether the approved phrases are in a list or a set. Right now when it sees !DETERMINER it can test pretty quickly whether "the" is in it or not. The problem is that the 650k approved phrases are all in a list (of lists of tokens) and it's really slow to try to find if one of them (potentially with wordclasses) matches all the candidate tokens. Does that make sense? – Will Penman Apr 21 '12 at 03:13
  • For your first point, `set`s and `dict`s are implementations based on hashing. Therefore, they offer an O(1) lookup time, versus lists, in which lookup is O(n). But I see your point. Let me think about this for a bit, before I hastily post something potentially useless – inspectorG4dget Apr 21 '12 at 03:28
  • Your main problem is that your large list (the one with all the phrases) is not sorted in any way. If you could put it in a more "useful" data structure as part of a "pre-processing step", then you'd have better luck. Right now, all you can do is go through the entire list -> slow. If you could turn that list into a tree (or better yet, a [trie](http://en.wikipedia.org/wiki/Trie)), then you'd get much more impressive speedups – inspectorG4dget Apr 21 '12 at 03:40
0

If speed is important and you have to deal with wordclasses, instead of storing a list of phrases and wordclasses you should consider storing it as a tree of words, such that the child's depth is it's position in the phrase. That way you can simply query each level and the search is narrowed as you move up. Once you can't find a word, you know the phrase isn't listed.

This is a very naive implementation, just as an example for you. You may even implement it as nested dicts like this, but if your data is large and dynamic, you should probably be using a database:

tree = {'he':
        {'had':
         {'the': {'nerve': {}, 'power': {}, 'gift': {}},
          'a': {'car': {}, 'bike': {}},
          'an': {'airplane': {}, 'awful': {'smell': {}}}
          }
         }
        }


def find_phrase(phrase, root):    
    if not phrase:
        return True

    try:
        next = root[phrase[0]]
        return find_phrase(phrase[1:], next)
    except KeyError:
        return False

    return False


assert find_phrase(['he', 'had', 'the', 'nerve'], tree) is True
assert find_phrase(['he', 'had', 'the', 'power'], tree) is True
assert find_phrase(['he', 'had', 'the', 'car'], tree) is False
assert find_phrase(['he', 'had', 'an', 'awful', 'smell'], tree) is True
assert find_phrase(['he', 'had', 'an', 'awful', 'wife'], tree) is False
Pedro Werneck
  • 40,902
  • 7
  • 64
  • 85
  • So essentially enumerate all the possible options? If I did that at least then I could make it all a set and it would be easy to test for membership. I'm just afraid it's going to blow up the number of phrases in the list... – Will Penman Apr 21 '12 at 03:35
  • It's not enumerating all possible options as a set and test for membership. You model your phrase data as a tree instead of a list, so you first search among the root nodes if the first word is there. If it is, then you search only its children for the second word, and so forth. If you find all words, you have a match. If you can't find a word at any level, you can quit the search. – Pedro Werneck Apr 21 '12 at 03:38
  • Yeah, but the root node could be a wordclass, and if there's 20 options in the wordclass and 15 approved phrases that start with that wordclass then you would need to model 20*15 phrases in the tree, right? That seems really similar to enumerating the possible options (but I don't know much about trees, maybe I'm not understanding something) – Will Penman Apr 21 '12 at 04:13
  • Enumerating the cases would be constructing every possible phrase. That's not what I'm suggesting with a tree. With a tree you create relationships between words, so you know which words can be the next, and therefore stop the search as soon as you can't find one word. I added a sample implementation in my answer so maybe the idea becomes clearer for you. In an implementation like that, finding if a phrase exists is only n O(1) lookups ahead, where n is the number of words in that phrase. Can't get much better than that. – Pedro Werneck Apr 21 '12 at 04:18