How to install nltk_data as package with pip?

Question

I am new at Python and coming from Java background.

I've got a project, which uses nltk and nltk_data. I downloaded nltk_data with nltk.download() on my laptop and the project works fine but I would like to automate the downloading of nltk_data.

I can download it from command line but I want to do it lazily as pip downloads package upon pip install. So my questions are:

Can I install nltk_data as a regular Python package with pip ?
What is the best way to download nltk_data lazily ?

It's not possible with pip because `nltk_data` is not a python library but just a repository of files. Use `python -m nltk.downloader all`. — alvas, May 30 '17 at 01:24
Thanks. Turned out I need just a subset of all those `nltk_data`. What is the best way to "pack" this subset as my project dependency to distribute my program ? — Michael, May 30 '17 at 08:18
I don't want to store these data in git because it's an external dependency rather than my project code. On the other hand I don't want to download `nltk_data` each time one builds the project. That's why I am wondering how to pack `nltk_data` as a python package. — Michael, May 30 '17 at 16:26

score 4 · Accepted Answer · answered May 29 '17 at 16:27

The bottom of the NLTK data documentation explains this:

Run the command python -m nltk.downloader all. To ensure central installation, run the command sudo python -m nltk.downloader -d /usr/local/share/nltk_data all.

If you want to distribute your program, you might want to consider writing a setuptools setup.py file to simplify installation:

What is setup.py?

Official packaging docs

How to install nltk_data as package with pip?

1 Answers1