Using Natural Language Tool Kit with Django on Heroku - - Error: 'nltk.txt' not found

Question

I’ve got a basic Django project. One feature I am working on counts the number of most commonly occurring words in a .txt file, such as a large public domain book. I’ve used the Python Natural Language Tool Kit to filter out “stopwords” (in SEO language, that means redundant words such as ‘the’, ‘you’, etc. ).

Anyways, I’m getting this debug traceback when Django serves the template:

Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m <<< import nltk nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html

So I need to download the library of stopwords. To resolve the issue, I simply open a Python REPL on my remote server and invoke these two straightforward lines:

<<< import nltk
<<< nltk.download('stopwords')

That's covered at length elsewhere on SO. That resolves the issue, but only temporarily. As soon as the REPL session is terminated on my remote server, the error returns because the stopwords file just evaporates.

I noticed something strange when I use git to push my changes up to my remote server on Heroku. Check this:

remote: -----> Python app detected
remote: -----> No change in requirements detected, installing from cache
remote: -----> Installing pip 20.1.1, setuptools 47.1.1 and wheel 0.34.2
remote: -----> Installing SQLite3
remote: -----> Installing requirements with pip
remote: -----> Downloading NLTK corpora…
remote:  !     'nltk.txt' not found, not downloading any corpora
remote:  !     Learn more: https://devcenter.heroku.com/articles/python-nltk 
remote: -----> $ python manage.py collectstatic --noinput
remote:        122 static files copied to '/tmp/build_f2f9d10f/staticfiles', 388 post-processed.

That devcenter link is kind of like a stub, meaning that it’s not very detailed. It’s sparse at best. The article says that to use Python nltk, you need to add an nltk.txt file to the project directory which specifies the list of objects for Heroku to download. So I went ahead and created an nltk text file which contained:

corpora

Here is this active nltk.txt currently located in my project directory. In addition to coprora, I also tried adding various combinations of the following three entries to nltk.txt:

corpus

stoplist

english

I tried adding all four, just two and just one. For example, here is an alternate nltk.txt that I tried verbatim. My feeling is that the main one I really need is just corpora, so that is the only entry in the nltk.txt that I am working with right now. With corpora there, when I push the change and Heroku builds the environment, I see this error and trace-back:

remote: -----> Downloading NLTK corpora…
remote: -----> Downloading NLTK packages: corpora english stopwords corpus
remote: /app/.heroku/python/lib/python3.6/runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote:   warn(RuntimeWarning(msg))
remote: [nltk_data] Error loading corpora: Package 'corpora' not found in
remote: [nltk_data]     index
remote: Error installing package. Retry? [n/y/e]
remote: Traceback (most recent call last):
remote:   File "/app/.heroku/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main
remote:     "__main__", mod_spec)
remote:   File "/app/.heroku/python/lib/python3.6/runpy.py", line 85, in _run_code
remote:     exec(code, run_globals)
remote:   File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 2538, in <module>
remote:     halt_on_error=options.halt_on_error,
remote:   File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 790, in download

I am clearly not using nltk.txt properly because it isn’t finding the corpora package. I can install nltk and have it run without issue in my local dev server but my remaining question is this: how do I make Heroku handle nltk properly remotely in this situation?

User Michael Godshall provides the same answer to more than one Stack Overflow question explaining that you can create a bin directory within the project root and add both a post_compile bash script and a install_nltk_data script. However this is no longer necessary because heroku-buildpack-python upstream maintainer Kenneth Reitz implemented an easy solution. All that is required now is to add an nltk.txt which contains the library you need. But I did that and I am still getting the error above.

The official nltk website documents how to use the library in general and how to install it which isn’t helpful in the case of Heroku because Heroku seems to handle nltk differently.

score 0 · Answer 1 · answered Nov 23 '20 at 06:03

0

Yes, you need the nltk.txt file similar to the requirements.txt file properly. refer to the official doc here. if you still facing the same situation post the nltk.txt file here that will give us some way to find the solution

maybe this also will help you

answered Nov 23 '20 at 06:03

Darkknight

1,716
10
23

Hi @Darkknight! In my original question, I did specify the contents of my `nltk.txt` is just: `corpora` but based on your feedback, I edited my question moments ago with an elaboration of the different entries inside nltk.txt that I tried which didn't work. As for the Heroku doc link you shared, it is just a stub and refers to the more detailed Heroku devcenter link that I already referred to in my original question. Thanks for your feedback so far. – enoren5 Nov 24 '20 at 00:28
1

@Angeles89 https://github.com/heroku/heroku-buildpack-python/pull/460 check this – Darkknight Nov 25 '20 at 05:10
1

@Angeles89 also the ending of the `nltk.txt` causes the same problem. can you post the exact `nltk.txt` file without any modification or check this https://help.github.com/articles/dealing-with-line-endings/ – Darkknight Nov 25 '20 at 05:12
Hi @Darkknight! Thanks for your further insight. I updated my question to include links to two sample nltk.txt files without any modification that I tried. For Heroku's Python buildpack issue on GitHub that you linked too, I am struggling to extrapolate what is required for my original question here. Based on the discussion in that issue, I'm not sure what to try in my case. My local native development environment is Manjaro Linux so line endings in my text files cannot be the problem in this situation. Although I appreciate the suggestion. Thanks again, Darkknight for your help so far. – enoren5 Nov 26 '20 at 17:55
1

instead of the corpus in your nlt.txt have you tried the exact corpus name? you can see the list here http://www.nltk.org/nltk_data/ (The method you mentioned will work locally but try the method I mentioned. not sure whether it will work) – Darkknight Nov 27 '20 at 05:01

score 0 · Accepted Answer · answered Nov 29 '20 at 14:24

Eureka! I got it working. My problem was with the name of the nltk library download. I tried stoplist when the actual name is stopwords. Ha! The contents of my nltk.txt is now simply: stopwords. When I pushed to Heroku, the build succeeded and my website is now deployed and accessible on the web.

Special thanks goes out to @Darkknight for his patience and insight in the comment section of his answer.

Using Natural Language Tool Kit with Django on Heroku - - Error: 'nltk.txt' not found

2 Answers2