5

I am trying to use the punkt tokenizer from the NLTK package with pyspark on a Spark standalone cluster. NLTK has been installed on the individual nodes, but the nltk_data folder doesn't reside where NLTK expects it (/usr/share/nltk_data).

I am attempting to use the punkt tokenizer, which is located in (whatever/my_user/nltk_data).

I have set:

envv1   = "/whatever/my_user/nltk_data"
os.environ['NLTK_DATA'] = envv1   

Printing nltk.data.path indicates that the first entry is where my nltk_data folder is actually located.

The from nltk import word_tokenize goes fine, but when it comes to calling the function word_tokenize() I get the following error:

ImportError: No module named nltk.tokenize

For whatever reason, I have no trouble accessing resources from nltk.corpus. When I try nltk.download(), it is clear I already have the punkt tokenizer downloaded. I can even use the punkt tokenizer outside of pyspark.

jamesmf
  • 183
  • 1
  • 8
  • possible duplicate of [How to config nltk data directory from code?](http://stackoverflow.com/questions/3522372/how-to-config-nltk-data-directory-from-code) – alvas Aug 12 '15 at 21:19
  • 1
    Use `nltk.data.path.append("/home/yourusername/whateverpath/")` – alvas Aug 12 '15 at 21:19
  • 1
    I actually can't get the nltk.data.path.append() to work. It looks like it only sets it on the master and all the others nodes dont know where to look. – dreyco676 Mar 11 '16 at 22:21
  • Have you been able to get this to work? – Stefan Falk May 03 '18 at 13:11
  • Unfortunately you will need to send the nltk_data directory to all nodes using, for example, `sc.addFile()` – VinceP Apr 09 '19 at 11:58

0 Answers0