You can keep your corpus files on your local directory and just add symlinks from an nltk_data/corpora
folder to the location of your corpus, as the paragraph you quoted suggests. But if you can't modify nltk_data
or just don't like the idea of a needless round trip through the nltk_data
directory, read on.
The object ptb
is just a shortcut to a corpus reader object initialized with the appropriate settings for the Penn Treebank corpus. It is defined (in nltk/corpus/__init__.py
) like this:
ptb = LazyCorpusLoader( # Penn Treebank v3: WSJ and Brown portions
'ptb', CategorizedBracketParseCorpusReader, r'(WSJ/\d\d/WSJ_\d\d|BROWN/C[A-Z]/C[A-Z])\d\d.MRG',
cat_file='allcats.txt', tagset='wsj')
You can ignore the LazyCorpusLoader
part; it's used because the nltk defines a lot of corpus endpoints, most of which are never loaded in any one python program. Instead, create a corpus reader by instantiating CategorizedBracketParseCorpusReader
directly. If your corpus looks exactly like the ptb
corpus, you'd call it like this:
from nltk.corpus.reader import CategorizedBracketParseCorpusReader
myreader = CategorizedBracketParseCorpusReader(r"<path to your corpus>",
r'(WSJ/\d\d/WSJ_\d\d|BROWN/C[A-Z]/C[A-Z])\d\d.MRG',
cat_file='allcats.txt', tagset='wsj')
As you can see, you supply the path to the real location of your files and leave the remaining arguments the same: They are a regexp of file names to include in the corpus, a file mapping corpus files to categories, and the tagset to use. The object you create will be exactly the same corpus reader as ptb
or treebank
(except that it is not lazily created).