Lambda not supporting NLTK file size

Question

I am writing a python script that analyses a piece of text and returns the data in JSON format. I am using NLTK, to analyze the data. Basically, this is my flow:

Create an endpoint (API gateway) -> calls my lambda function -> returns JSON of required data.

I wrote my script, deployed to lambda but I ran into this issue:

Resource \u001b[93mpunkt\u001b[0m not found. Please use the NLTK Downloader to obtain the resource:

\u001b[31m>>> import nltk nltk.download('punkt') \u001b[0m
Searched in: - '/home/sbx_user1058/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/var/lang/nltk_data' - '/var/lang/lib/nltk_data'

Even after downloading 'punkt', my script still gave me the same error. I tried the solutions here :

Optimizing python script extracting and processing large data files

but the issue is, the nltk_data folder is huge, while lambda has a size restriction.

How can I fix this issue? Or where else can I use my script and still integrate API call?

I am using serverless to deploy my python scripts.

1.4G, that's due to the nltk library and the standford library. Any ideas on how or where I can host the code ? — noor, Oct 24 '17 at 01:55
You don't need to download full nltk, If you just need `punkt` why not download just that? — Tarun Lalwani, Oct 24 '17 at 08:45

0bserver07 · Accepted Answer · 2017-10-30T16:54:09.607

There are two things that you can do:

The errors seems like the path is not being defined properly, maybe set it as an env Variable?

sys.path.append(os.path.abspath('/var/task/nltk_data/')

or this way

Once you run nltk.download(), then copy it to the root folder of your AWS lambda application. (Name the dir to be called "nltk_data".)
In the lambda function dashboard (in the AWS console), add NLTK_DATA=./nltk_data as a key-var Environment Variable.

reduce the size of the nltk downloads, since you won't be needing all of them.
1. Delete all the zip files, keep only the needed section, for example: stopwords. That can be moved into: save nltk_data/corpora/stopwords and delete the rest.
2. Or If you need tokenizers save to nltk_data/tokenizers/punkt. Most of these can be separately downloaded: python -m nltk.downloader punkt, then copy over the files.

How do you copy a file to the root folder of the AWS lambda application? Is this actually possible? — jimiclapton, Jan 15 '21 at 00:07

Lambda not supporting NLTK file size

1 Answers1

Linked