I am trying to use the punkt tokenizer from the NLTK package with pyspark on a Spark standalone cluster. NLTK has been installed on the individual nodes, but the nltk_data folder doesn't reside where NLTK expects it (/usr/share/nltk_data).
I am attempting to use the punkt tokenizer, which is located in (whatever/my_user/nltk_data).
I have set:
envv1 = "/whatever/my_user/nltk_data"
os.environ['NLTK_DATA'] = envv1
Printing nltk.data.path indicates that the first entry is where my nltk_data folder is actually located.
The from nltk import word_tokenize
goes fine, but when it comes to calling the function word_tokenize()
I get the following error:
ImportError: No module named nltk.tokenize
For whatever reason, I have no trouble accessing resources from nltk.corpus. When I try nltk.download(), it is clear I already have the punkt tokenizer downloaded. I can even use the punkt tokenizer outside of pyspark.