7

I am new at Python and coming from Java background.

I've got a project, which uses nltk and nltk_data. I downloaded nltk_data with nltk.download() on my laptop and the project works fine but I would like to automate the downloading of nltk_data.

I can download it from command line but I want to do it lazily as pip downloads package upon pip install. So my questions are:

  • Can I install nltk_data as a regular Python package with pip ?
  • What is the best way to download nltk_data lazily ?
Michael
  • 41,026
  • 70
  • 193
  • 341
  • 1
    It's not possible with pip because `nltk_data` is not a python library but just a repository of files. Use `python -m nltk.downloader all`. – alvas May 30 '17 at 01:24
  • Thanks. Turned out I need just a subset of all those `nltk_data`. What is the best way to "pack" this subset as my project dependency to distribute my program ? – Michael May 30 '17 at 08:18
  • Which dataset do you need? – alvas May 30 '17 at 08:47
  • `corpora/stopwords`, `stemmers`, and `tokenizers`. – Michael May 30 '17 at 09:05
  • 1
    I don't want to store these data in git because it's an external dependency rather than my project code. On the other hand I don't want to download `nltk_data` each time one builds the project. That's why I am wondering how to pack `nltk_data` as a python package. – Michael May 30 '17 at 16:26

1 Answers1

4

The bottom of the NLTK data documentation explains this:

Run the command python -m nltk.downloader all. To ensure central installation, run the command sudo python -m nltk.downloader -d /usr/local/share/nltk_data all.

If you want to distribute your program, you might want to consider writing a setuptools setup.py file to simplify installation:

What is setup.py?

Official packaging docs

Azsgy
  • 3,139
  • 2
  • 29
  • 40