I'm struggling to get a script working and wondering if anyone else has successfully done this. I'm using Glue to execute a spark script and am trying to use the NLTK module to analyze some text. I've been able to import the NLTK module by uploading it to s3 and referencing that location for the Glue additional python module config. However, I'm using the word_tokenize method which requires the punkt library to be downloaded in the nltk_data directory.
I've followed this (Download a folder from S3 using Boto3) to copy the punkt files to the tmp directory in Glue. However, if I look into the tmp folder in an interactive glue session I don't see the files. When I run the word_tokenize method I get an error saying that the package cant be found in the default locations (variations of /usr/nltk_data).
I'm going to move the required files into the nltk package in s3 and try to try to re-write the nltk tokenizer to load the files directly instead of the nltk_data location. But wanted to check here first if anyone was able to get this working as this seems fairly common.