How to add a custom corpora to local machine in nltk

Question

I have a custom corpora that created with data which i need to do some classification. I have the dataset in a same format that movie_reviews corpora contains. According to nltk documentation i use following code to access to movie_reviews corpora. Is there anyway to add any custom corpora to nltk_data/corpora directory and access that corpora as the same way we access existing corpora.

    import nltk
    from nltk.corpus import movie_reviews

    documents = [(list(movie_reviews.words(fileid)), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]

alexis · Accepted Answer · 2017-02-11T21:21:02.467

While you could hack the nltk to make your corpus look like a built-in nltk corpus, that's the wrong way to go about it. The nltk provides a rich collection of "corpus readers" that you can use to read your corpora from wherever you keep them, without moving them to the nltk_data directory or hacking the nltk source. The nltk's own corpora use the same corpus readers behind the scenes, so your reader will have all the methods and behavior of equivalent built-in corpora.

Let's see how the movie_reviews corpus is defined in nltk/corpora/__init__.py:

movie_reviews = LazyCorpusLoader(
    'movie_reviews', CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*',
    encoding='ascii')

You can ignore the LazyCorpusLoader part; it's for providing corpora that your program will most likely never use. The rest shows that the movie review corpus is read with a CategorizedPlaintextCorpusReader, that its files all end in .txt, and that the reviews are sorted into categories through being in the subdirectories pos and neg. Finally, the corpus encoding is ascii. So define your own corpus like this (changing values as needed):

mycorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
    r"/home/user/path/to/my_corpus",
    r'(?!\.).*\.txt', 
    cat_pattern=r'(neg|pos)/.*',
    encoding="ascii")

That's it; you can now call mycorpus.words(), mycorpus.sents(categories="neg"), etc., just as if this was a corpus provided by the nltk.

score 1 · Answer 2 · answered Feb 11 '17 at 16:13

First put the actual data from your new corpus into your nltk_data/corpora/ directory. Then you have to edit the __init__.py file for nltk.corpus. You can find the path to this file by doing:

import nltk
print(nltk.corpus.__file__)

Open this file in a text editor and you will see that most of the file is creating LazyCorpusLoader objects and assigning them to global variables.

So for example, a section may look like:

....
verbnet = LazyCorpusLoader(
    'verbnet', VerbnetCorpusReader, r'(?!\.).*\.xml')
webtext = LazyCorpusLoader(
    'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')
wordnet = LazyCorpusLoader(
    'wordnet', WordNetCorpusReader,
    LazyCorpusLoader('omw', CorpusReader, r'.*/wn-data-.*\.tab', encoding='utf8'))  
....

In order to add a new corpus you just have to add a new line to this file in the same format as the examples above. So if you have a corpus named movie_reviews and you have the data saved in nltk_data/corpora/movie_reviews then you would want to add a line like:

movie_reviews = LazyCorpusLoader('movie_reviews', .... )

Additional arguments for LazyCorpusLoader can be found in the docs here.

Then you just save this file and you should then be able to do:

from nltk.corpus import movie_reviews

And then, one day, you update NLTK and these changes will be wiped away without notice. It's safer to go with alexis' answer, really. — lenz, Feb 11 '17 at 23:27
@bunji - Tried with alexis way. It's working. Thank for ur guide — Janitha, Feb 12 '17 at 05:29
@Janitha, glad to help. I guess I misinterpreted your request to access it "the same way we access existing corpora" as meaning it should be importable like existing corpora. my bad... — bunji, Feb 12 '17 at 15:24

score 1 · Answer 3 · edited Nov 18 '17 at 18:57

Ok, so I had a bit of a problem with the solution provided and I find the easiet way that worked for me is to first create my folders and subfolder in the 'corpora' directory and then edit the init.py doc.

so in my case the corpus I created was vc and the subfolders were audio_them, audio_us, video_them, video_us

vc = LazyCorpusLoader(
    'vc', CategorizedPlaintextCorpusReader,
    r'(?!\.).*\.txt', 
cat_pattern = r'(audio_them|audio_us|video_them|video_us)/.$
    encoding="ascii")

How to add a custom corpora to local machine in nltk

3 Answers3