3

I have a problem and I have no idea how to solve it. Please, give a piece of advice.

I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.

Bob
  • 10,427
  • 24
  • 63
  • 71
  • Specifically, it's pretty hard to come across texts so big that the obvious algorithms don't work (list all the three-word phrases and count them). – Katriel Dec 24 '10 at 14:27

4 Answers4

7

You have, it seems to me, two problems.

The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?

A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.

A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:

>>> def words(text):
       pattern = re.compile(r"[^\s]+")
       non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
       for match in pattern.finditer(text):
           nxt = non_alpha.sub("", match.group()).lower()
           if nxt:  # skip blank, non-alpha words
               yield nxt


>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']

The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:

>>> def phrases(words):
        phrase = []
        for word in words:
            phrase.append(word)
            if len(phrase) > 3:
                phrase.remove(phrase[0])
            if len(phrase) == 3:
                yield tuple(phrase)

>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]

There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.

Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:

>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
        counts[phrase] += 1

This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.

Skylar Saveland
  • 11,116
  • 9
  • 75
  • 91
Robert Rossney
  • 94,622
  • 24
  • 146
  • 218
  • +1. Using iterators is the only option when it comes to processing really big data. The only suggestion from my side is to use `collections.Counter` (and it's `most_common` method) instead of `collections.defaultdict`. But AFAIK this is only available in Python > 2.7 http://docs.python.org/library/collections.html#collections.Counter – Tomasz Elendt Dec 24 '10 at 22:08
2

the crudest way would be to read text in a string. Do a string.split() and get individual words in a list. You could then slice list per three words, and use collections.defaultdict(int) for keeping the count.

d = collections.defaultdict(int)

d[phrase]+=1

as I said, its very crude. But should certainly get you started

easysid
  • 504
  • 2
  • 6
  • 13
  • i started to do that way before posting this question, but it's crude. Problem is that text contains a great amount of sumbols like ! ? : " . ( ) etc. Should I write string.split( for every sumbol )? – Bob Dec 24 '10 at 14:54
  • @user `re.findall(r"[\w']+", "Hello, world!")` would be a better starting point. – moinudin Dec 24 '10 at 15:01
  • @user Yes punctuation quickly gets complicated. if you split on a full stop, then abbreviations will be split into multiple letters. What about quotes (which might be apostrophes). So I think you have to make some assumptions. To it properly would require a classification system (as is possible with NLTK) but this would be relatively slow and might not give a significant enough improvement. – winwaed Dec 24 '10 at 16:52
  • See the following link to remove punctuations and symbols http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python. However it has its own drawbacks as mentioned in previous comment. You should then probably look at NLTK as @winweed suggests. – easysid Dec 25 '10 at 17:10
1

I would suggest looking at the NLTK toolkit. This is open source and intended for natural language teaching. as well as higher level NLP functions, it has a lot of tokenizing type of functions and collections.

winwaed
  • 7,645
  • 6
  • 36
  • 81
0

Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.

>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
...     phrases[(a, b, c)] += 1
... 
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []
moinudin
  • 134,091
  • 45
  • 190
  • 216