0

I'm struggling to get a script working and wondering if anyone else has successfully done this. I'm using Glue to execute a spark script and am trying to use the NLTK module to analyze some text. I've been able to import the NLTK module by uploading it to s3 and referencing that location for the Glue additional python module config. However, I'm using the word_tokenize method which requires the punkt library to be downloaded in the nltk_data directory.

I've followed this (Download a folder from S3 using Boto3) to copy the punkt files to the tmp directory in Glue. However, if I look into the tmp folder in an interactive glue session I don't see the files. When I run the word_tokenize method I get an error saying that the package cant be found in the default locations (variations of /usr/nltk_data).

I'm going to move the required files into the nltk package in s3 and try to try to re-write the nltk tokenizer to load the files directly instead of the nltk_data location. But wanted to check here first if anyone was able to get this working as this seems fairly common.

Zcauchon
  • 48
  • 1
  • 7

2 Answers2

1

I have limited experience with NLTK, but I think the nltk.download() will put punkt in the right spot.

import nltk

print('nltk.__version__', nltk.__version__)

nltk.download('punkt')

from nltk import word_tokenize

print(word_tokenize('Glue is good, but it has some rough edges'))

From the logs

nltk.__version__ 3.6.3
[nltk_data] Downloading package punkt to /home/spark/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
['Glue', 'is', 'good', ',', 'but', 'it', 'has', 'some', 'rough', 'edges']
Bob Haffner
  • 8,235
  • 1
  • 36
  • 43
  • Thanks for posting, I was getting an I/O error when I tried using the nltk download directly in Glue. I gave Glue full access to the appropriate s3 buckets so I wasn't expecting a permissions issue. I'll tweak the permissions and try again. – Zcauchon May 20 '22 at 13:17
  • Ever get it to work? – Bob Haffner May 21 '22 at 16:39
0

I wanted to follow up here in case anyone else encounters these issues and can't find a working solution.

After leaving this project alone for a while I finally came back and was able to get a working solution. Initially I was adding my tmp location to the nltk_data path and downloading the required packages there. However, this wasnt working.

nltk.data.path.append("/tmp/nltk_data")
nltk.download("punkt", download_dir="/tmp/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="/tmp/nltk_data")

Ultimately, I believe the issue was that the file I needed from punkt was not available on the worker nodes. Using the addFile method I was finally able to use nltk data.

sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')

The next issue I had was that I was trying to call a UDF function from a .withColmn() method to get the nouns for each row. The issue here is that withColummn requires that a column be passed but nltk will only work with string values.

Not working:

df2 = df.select(['col1','col2','col3']).filter(df['col2'].isin(date_list)).withColumn('col4', find_nouns(col('col1'))

In order to get nltk to work I passed in my full dataframe and looped over every row. Using collect to get the text value of the row then building a new dataframe and returning that with all the original columns plus the new nltk column. To me this seems incredible inefficient but I wasn't able to get a working solution without it.

df2 = find_nouns(df)

def find_nouns(df):
    data = []
    schema = StructType([...])
    is_noun = lambda pos: pos[:2] == 'NN'
    for i in range(df.count()):
        row = df.collect()[i]
        tokenized = nltk.word_tokenize(row[0])
        data.append((row[0], row[1], row[2], [word for (word, pos) inn nltk.pos_tag(tokenized) if is_noun(pos)]))
    df2 = spark.createDataFrame(data=data, schema=schema)
    return df2

I'm sure there's a better solution out there, but I hope this can help someone get their project to an initial working solution.

Zcauchon
  • 48
  • 1
  • 7