3

Please, please, please help. I have a folder filled with text files that I want to use NLTK to analyze. How do I import that as a corpus and then run NLTK commands on it? I've put together the code below but it's giving me this error:

    raise error, v # invalid expression
sre_constants.error: nothing to repeat

Code:

import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus_root = '/Users/jt/Documents/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '*.txt')

print "Finished importing corpus" 

words = FreqDist()

for sentence in speeches.sents():
    for word in sentence:
        words.inc(word.lower())

print words["he"]
print words.freq("he")
Jolijt Tamanaha
  • 333
  • 2
  • 9
  • 23
  • 1
    You're not giving us much to go on. In short, **where** do you have an error? Please include the full error trace for starters, then go over your program step by step. Does your corpus consist of `.txt` files in the directory `CRspeeches`? After initializing `speeches`, do you get a list of your files with `print(speeches.fileids())`? Can you _print_ some of the sentences that should be returned by `speeches.sents()`? – alexis Sep 28 '14 at 22:03

1 Answers1

3

I understand this problem has to do with a known bug (maybe it's a feature?), which is partially explained in this answer. In short, certain regexes about empty things blow up.

The source of the error is you speeches = line. You should change it to the following:

speeches = PlaintextCorpusReader(corpus_root, r'.*\.txt')

Then everything will load and compile just fine.

Community
  • 1
  • 1
davidlowryduda
  • 2,404
  • 1
  • 25
  • 29
  • Do I have to keep loading the corpus whenever I use it or can I now just write import speeches at the top of my nltk scripts? – Jolijt Tamanaha Sep 28 '14 at 23:20
  • Well spotted, @mixedmath! But it's not a bug: A regexp that starts with `*` is malformed. (The error message could have been more informative, though.) – alexis Oct 01 '14 at 13:32
  • Let's clarify: `*.txt`, which the OP tried, is a _glob_ that matches all files with the extension `.txt`. But the NLTK's corpus readers don't accept globs, they accept full regular expressions. @mixedmath's solution translates @Jolijt's glob to the equivalent regexp, `.*\.txt`. – alexis Oct 03 '14 at 12:33