3

Here is my code, just performing some tokenization with nltk.

import nltk
from nltk.corpus import stopwords
tokens = nltk.word_tokenize(doc, language='english')
# remove all the stopwords
filtered = [w for w in tokens if (w not in stopwords.words('english')) and (w.isalnum())]

I've already downloaded the punkt package. I also tried to copy and paste the correct folder into the places that the error message said it searched. Here is the error, that I saw in other similar questions.

Resource u'tokenizers/punkt/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>

nltk.download() Searched in:

- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''

I even tried to reinstall the whole nltk and packages, but it didn't work. Useful information about the environment: -run through terminal of Pycharm IDE -operting system: Ubuntu 15 -nltk installed using pip -nltk_data installed in the default location /home/user/nltk_data

Please, don't tell me to use nltk.download('punkt') because I have it. Thanks for your help.

  • try doing a computer wide search for the file and check if it got downloaded in correct directory. find / |grep punkt/english.pickle – Rafi Sep 01 '16 at 14:48
  • There's another magic issue: It works perfectly when running as a single script, but it raises that exeption when called in a bigger software piece. @Rafi I did and it is there – LEONARDO MIGLIORINI Sep 01 '16 at 16:00
  • Here's a few questions that you have to answer before we can help you. What is the bigger software piece? How did you run it (through IDE or on the terminal)? Are you using windows? Which OS are you using? How did you install NLTK (anaconda or pip)? Where did you run your python script? Where did you save the nltk_data directory? – alvas Sep 01 '16 at 20:55
  • In order, The piece of software consists in some python files that perform basic operation on other files (it works perfectly if I remove those lines of nltk); I run it through The terminal of pycharm IDE (I cannot run it directly with the simple "run" of the IDE because I need root permission; I'm using ubuntu 15 (same problem on ubuntu 16); I installed nltk using pip; I run The script with button run (not terminal), and it works; nltk_data is sacra in The default location /home/user/nltk_data and I tried to copy it into different locations, those "suggested" in error msg. Thanks for interest – LEONARDO MIGLIORINI Sep 01 '16 at 21:05
  • Please add the information by editing the question instead of putting it in the comment, thank you =) – alvas Sep 02 '16 at 04:14
  • I think you haven't configured nltk directory properly please see http://stackoverflow.com/a/22987374/610569 . Also take a look at `magically_find_nltk_data()` in http://stackoverflow.com/questions/36382937/nltk-doesnt-add-nltk-data-to-search-path/36383314#36383314 – alvas Sep 02 '16 at 04:18
  • Did the solutions in the previous posts solve your problem? If so which? If not can you update your questions with the errors you get after using the solutions? – alvas Sep 02 '16 at 06:02
  • It works by appending the path to nltk.data.path Thank you very much for your help and your patience. – LEONARDO MIGLIORINI Sep 02 '16 at 07:47

2 Answers2

3

You have to install the nltk-punkt to tokenize.

  • How?

    1. Open a Terminal.
    2. Execute python command to enter the python environment.
    3. Execute import nltk
    4. Execute nltk.download('punkt')

Your terminal might look this way:

enter image description here

tremendows
  • 4,262
  • 3
  • 34
  • 51
0

If you're running this in a distributed environment you'll have to download the NLTK data files out to each node. Here's how you would do it in a Spark environment:

 sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Zcauchon
  • 48
  • 1
  • 7