First put the actual data from your new corpus into your nltk_data/corpora/
directory. Then you have to edit the __init__.py
file for nltk.corpus
. You can find the path to this file by doing:
import nltk
print(nltk.corpus.__file__)
Open this file in a text editor and you will see that most of the file is creating LazyCorpusLoader
objects and assigning them to global variables.
So for example, a section may look like:
....
verbnet = LazyCorpusLoader(
'verbnet', VerbnetCorpusReader, r'(?!\.).*\.xml')
webtext = LazyCorpusLoader(
'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')
wordnet = LazyCorpusLoader(
'wordnet', WordNetCorpusReader,
LazyCorpusLoader('omw', CorpusReader, r'.*/wn-data-.*\.tab', encoding='utf8'))
....
In order to add a new corpus you just have to add a new line to this file in the same format as the examples above. So if you have a corpus named movie_reviews
and you have the data saved in nltk_data/corpora/movie_reviews
then you would want to add a line like:
movie_reviews = LazyCorpusLoader('movie_reviews', .... )
Additional arguments for LazyCorpusLoader can be found in the docs here.
Then you just save this file and you should then be able to do:
from nltk.corpus import movie_reviews