Python Package with NLTK as a Dependency

Question

I've looked around for a question pertaining to this without any hits, so here we go:

I am working on a toy python package to deploy on PyPi.org. A part of its job involves streamlining the process of parsing text and generating tokenized sentences. Naturally, I have considered using nltk for the job, having personally used tools like punkt from the package.

Here's the problem and my question: Having looked at the size of nltk and the requirements for it to work, with the corpora nearly 10 gigabytes in size, I've come to the conclusion this is an outlandish burden to put on anyone who wants to use my package given its use-case.

Is there anyway to deploy a "pre-trained" instance of punkt? Or can I control the size of the corpora used by nltk?

I am equally open to an alternative package/solution for parsing relatively "sane" human text that is somewhat close to performance of nltk but without the same disk memory footprint.

Thanks for any help.

solution as indicated below by @matisetorm for me is:

python -m nltk.downloader punkt

You can selectively download corpora like described [here](https://stackoverflow.com/questions/5843817/programmatically-install-nltk-corpora-models-i-e-without-the-gui-downloader) or using the [GUI](http://www.nltk.org/data.html) — patrick, Feb 21 '18 at 01:21

score 1 · Accepted Answer · answered Feb 21 '18 at 01:38

1

Absolutely.

1) You can selectively download corpora like described at Programmatically install NLTK corpora / models, i.e. without the GUI downloader? For example,

python -m nltk.downloader <your package you would like to download>

2) or using the GUI with instructions at http://www.nltk.org/data.html

Which basically amounts to doing the following and command line

python3
import nltk
nltk.download()

answered Feb 21 '18 at 01:38

matisetorm

857
8
21

Your answer and Patrick's point to the same solution. I'd like to add the knowledge I found in this other [post](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk) which seems to indicate that `punkt` itself comes with pre-trained models. Limiting myself to `punkt` puts the memory footprint at only ~50mb, much more reasonable. – zaile Feb 21 '18 at 01:41
but of course. :) you should always limit yourself to only those data libraries you need. See option 1. punkt is a pretty awesome resource. There have been times I just download the one I need and put it into my repository directly – matisetorm Feb 21 '18 at 01:47

Python Package with NLTK as a Dependency

1 Answers1