Azure HDInsight - Resource u'tokenizers/punkt/english.pickle' not found

Question

I imported nltk package. I need to use nltk.sent_tokenize and nltk.word_tokenize and when I do, I get the following error no matter what:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.0.0.4): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<stdin>", line 2, in <lambda>
  File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 85, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/data.py", line 781, in load
    opened_resource = _open(resource_url)
  File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/data.py", line 895, in _open
    return find(path_, path + ['']).open()
  File "/usr/bin/anaconda/lib/python2.7/site-packages/nltk/data.py", line 624, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

I've referred to many posts discussing about this topic and tried nltk.download(all') , -d, arranging the sub-folders as well.

Azure ML's python doesn't come with nltk library. There should be some other way to use nltk on this platform. Please help!! Thank you!

Possible duplicate of [Resource u'tokenizers/punkt/english.pickle' not found](http://stackoverflow.com/questions/26570944/resource-utokenizers-punkt-english-pickle-not-found) — alvas, Apr 19 '16 at 06:59
@alvas - I've already tried everything mentioned in the above link but they won't work. Can you please suggest me any other alternative that is specific for solving this on Azure HDInsight..? — preetham madeti, Apr 19 '16 at 11:54
Are you sure you've tried everything? Including `import nltk; nltk.download('all')`? — alvas, Apr 19 '16 at 12:18
@alvas could that be a problem because of the root user permissions? please share some knowledge regarding this if its relevant. Thank you! — preetham madeti, Apr 19 '16 at 16:01
Can you show the output of `import nltk; nltk.download('all')`, I believe you but I need some more information to help you. Can you show what you've tried and the outputs, so that we can diagnose the problem? — alvas, Apr 19 '16 at 22:23
Possibly, you need some permissions to the place there `nltk_data` directory is stored. can you also try http://stackoverflow.com/questions/36382937/nltk-doesnt-add-nltk-data-to-search-path/36383314#36383314 to find where `nltk_data` is stored? — alvas, Apr 19 '16 at 22:24

score 0 · Answer 1 · answered Apr 20 '16 at 09:24

According to the error information, I think the issue was caused by two reason below as @alvas said.

The resource not exist.
The resource path not be appended the path list nltk.data.path.

I tried to follow the comments posted by @alvas to do the steps below.

# 1. Installation for package `nltk` using `pip`

~ $ pip install nltk

# 2. Download resource in Python REPL.

~ $ python
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> punkt
    Downloading package punkt to /home/<username>/nltk_data...
      Unzipping tokenizers/punkt.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q
True
>>>

# 3. Check the resource path listed in the path list `nltk.data.path`
>>> nltk.data.path
['/home/<username>/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']

~ $ pwd && ls nltk_data
/home/<username>
tokenizers

# 4. If the dir `nltk_data` not in the path list `nltk.data.path`, 
#    you can try to move the dir to the specified path listed in the path list, 
~ $ mv nltk_data /usr/local/share/

#    or append the path of the dir `nltk_data` into the list variable `nltk.data.path`.
>>> nltk.data.path.append('/home/<username>/nltk_data/')

Hope it helps.

I did this too :( the step 3 shows exactly what you have in there. So the nltk directory and the sub-directory 'tokenizers' were present. NLTK works inside the console. But, I am accessing these modules on a jupyter notebook over the azure cluster and this is the place where nltk doesn't work Note: I wanted to tell you this - I login into the jupyter notebook using the username: 'admin' and the relevant password. Whereas, in order to download any packages on to the cluster (such as the above steps), I login into the shell using a different username. Would that be the problem..? — preetham madeti, Apr 20 '16 at 12:31
@prog-life You can try to move the directory `nltk_data` to the other paths like `/usr/share` or `/usr/local/share` if you have the permission for using `sudo` or `su`. — Peter Pan, Apr 21 '16 at 06:36

Azure HDInsight - Resource u'tokenizers/punkt/english.pickle' not found

1 Answers1