16

I use NLTK with wordnet in my project. I did the installation manually on my PC, with pip: pip3 install nltk --user in a terminal, then nltk.download() in a python shell to download wordnet.

I want to automatize these with a setup.py file, but I don't know a good way to install wordnet.

For the moment, I have this piece of code after the call to setup ("nltk" is in the install_requires list of the call to setup):

import sys
if 'install' in sys.argv:
    import nltk
    nltk.download("wordnet")

Is there a better way to do this?

Arne
  • 17,706
  • 5
  • 83
  • 99
Tom Cornebize
  • 1,362
  • 15
  • 33
  • @martin-thoma from a quick glance, looks like the _nltk data_ dependencies could be packaged as Python projects and distributed on PyPI without too much work. The whole thing could be relatively easily scripted and delegated to a CI/CD system. You should weigh in on these tickets: https://github.com/nltk/nltk_data/issues/12 https://github.com/nltk/nltk/issues/2228 – sinoroc Oct 12 '19 at 15:11
  • @martin-thoma also, here is a rather similar post I wrote about the same problem with spacy: https://stackoverflow.com/questions/57773454/package-spacy-model/57782864#57782864 does that apply to your situation as well? – Arne Oct 14 '19 at 07:13
  • For my use case, the best option seemed to be to list all dependencies in a `requirements.txt` file and use `pip install -r requirements.txt` first. Then in my `setup.py` I have the manual download command `nltk.download("punkt")` which is used when I run `pip install -e .` I believe this works because I'm building a Docker image/container, not trying to distribute a package. – rkechols Jan 28 '22 at 23:46

3 Answers3

14

I managed to install the NLTK data in setup.py by overriding cmdclass with my own Install class :

from setuptools import setup, find_packages
from setuptools.command.install import install as _install


class Install(_install):
    def run(self):
        _install.do_egg_install(self)
        import nltk
        nltk.download("popular")

setup(...
    cmdclass={'install': Install},
    ...
    install_requires=[
      'nltk',
      ],
    setup_requires=['nltk']
    ...
   )

It is important to use the method do_egg_install() in your run() method to make sure nltk gets installed, before import nltk is called (See also here python setuptools install_requires is ignored when overriding cmdclass). Also don't forget to add nltk to setup_requires.

alvas
  • 115,346
  • 109
  • 446
  • 738
asmaier
  • 11,132
  • 11
  • 76
  • 103
3

You can also automate installation with a shell script, for example, running (after pip installing nltk):

python -m nltk.downloader -d /usr/share/nltk_data wordnet
transcranial
  • 381
  • 2
  • 3
1

As stated in this thread, external data should not be handled by setuptools in setup.py. As an alternative I suggest that in the __init__.py file of your package you include the following lines (putting the case that you want to download the punkt and stopwords) :

__version__ = "x.x.x"
__organization__ = "your_organization"  
import nltk 
nltk.download("stopwords") 
nltk.download("punkt")  

This way the files will not be downloaded when the package is installed, but when it is imported (i.e. import my_package).


As an example I share a link to a python library that does just this.

First you would have to install the library:

pip install -U pyleetspeak

And then importing the library will download the NLTK files:

import pyleetspeak
pyleetspeak.__version__

enter image description here

Álvaro H.G
  • 397
  • 1
  • 10