I would like to call NLTK to do some NLP on databricks by pyspark. I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
I run the above code and got:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
When I run the following code:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
I got error:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
I have downloaded 'punkt'. It is located at
/root/nltk_data/tokenizers
I have updated the PATH in spark environment with the folder location.
Why it cannot be found ?
The solution at NLTK. Punkt not found and this How to config nltk data directory from code? but none of them work for me.
I have tried to updated
nltk.data.path.append('/root/nltk_data/tokenizers/')
it does not work. It seems that nltk cannot see the new added path !
I also copied punkz to the path where nltk will search for.
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
but, nltk still cannot see it.
thanks