Matching multiple words with FreqDist in nltk

Question

import nltk
from nltk.tokenize import word_tokenize

txt = "finding a common place isn't commonly available among commoners place"

fd = nltk.FreqDist()

for w in word_tokenize(a.lower()):
    fd[w] += 1

I have the above script that works fine. If I do fd['place'] I get 2, if I type fd['common'] I get 1.

Is it possible to type something similar to fd['common*'] (which doesn't work) to obtain 3 and possibly a list of those matches? The three matches would be (common, commonly, commoners)

I'm assuming it has something to do with regex but not sure how to implement with FreqDist()

If not, are there any other packages that might do that?

score 0 · Answer 1 · edited May 23 '17 at 12:22

FreqDist is just a kind of dictionary, and dictionary keys only work by exact match.

To use regexps for something like this, you need to do it the hard way: Iterate over all the entries and add up the counts for the words that match. Of course, this needs to scan the whole list so it will be slow if the list is large, and you need to do it a lot.

If you're only after matching by prefixes, use a data structure called a "prefix tree" or "trie". You can probably guess what it does. A simple work-around would be to record counts in the FreqDist for each prefix of each word you see (so not just for the complete word).

score 0 · Answer 2 · answered Sep 30 '15 at 20:03

0

Utilizing Ch 3.4 This is what I ended up doing

import re

for w in fd:
    if re.search('common', w):
        print(w,fd[w])

answered Sep 30 '15 at 20:03

Leb

15,483
10
56
75

As I wrote in my answer, this is extremely slow since you have to scan your entire vocabulary for every word you check. – alexis Oct 01 '15 at 15:58
I agree with you. `FreqDist` is slow overall I might ditch it for `Counter` which seems to run significantly faster. The problem lies not only at prefixes, my example might have showed it that way but I'm looking for all matches that contain that word. – Leb Oct 01 '15 at 16:15
But you're just enumerating and scanning all the keys. This is far slower than any possible difference between `FreqDist` and `Counter`. If you want fast lookups, you need to index the substrings or you need to change your approach. – alexis Oct 01 '15 at 18:37
Sorry for the confusion, the lookup itself isn't slow, I was referring to the actual counting process being fast between `FreqDist` and `Counter`. – Leb Oct 01 '15 at 19:03

Matching multiple words with FreqDist in nltk

2 Answers2