Finding path for corpus in NLTK

Question

I am using the Natural Language Toolkit for python to write a program. In it I am trying to load a corpus of my own files. To do that I am using code to the following effect:

from nltk.corpus import PlaintextCorpusReader
corpus_root=(insert filepath here)
wordlists=PlaintextCorpusReader(corpus_root, '.*')

Let's say my file is called reader.py and my corpus of files is located in a directory called 'corpus' in the same directory as reader.py. I would like to know a way to generalize finding the filepath above, so that my code could find the path for the 'corpus' directory for any location for anyone using the code. I have tried these posts, but they only allow me to get absolute file paths: Find current directory and file's directory

Any help would be greatly appreciated!

score 2 · Answer 1 · edited Jun 01 '18 at 17:45

2

C:\Users\UserName\AppData\Roaming\nltk_data\corpora

I used Anaconda Platform, with conda environment... my corpora location

edited Jun 01 '18 at 17:45

LuFFy

8,799
10
41
59

answered Jun 01 '18 at 13:28

Kiran Maharjan

111
3

1

Does this really answer the question? – Vega Jun 01 '18 at 14:00

score 1 · Accepted Answer · edited May 23 '17 at 12:05

From what I understand

Your reader.py file and corpus directory are always in the same directory
You're looking for a way to refer to corpus from reader.py regardless of where you put them in your directory structure

In that case, the question that you referred to seems to be what you need. Another way of doing it is in this other answer. Using that second option, your code would then be:

from nltk.corpus import PlaintextCorpusReader
import os.path
import sys

basepath = os.path.dirname(__file__)
corpus_root= os.path.abspath(os.path.join(basepath, "corpus"))
wordlists=PlaintextCorpusReader(corpus_root, '.*')

Keep in mind that while an absolute path is created, it is created based on the information obtained in the basepath = os.path.dirname(__file__) bit above, which yields reader.py's current directory. Have a look at the documentation for some official documentation.

Finding path for corpus in NLTK

2 Answers2