16

I'm encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I'm aware that the corpora need to be downloaded and have done so with NLTK.download('stopwords') and included them in the zip file used to upload the lambda modules in nltk_data/corpora/stopwords.

The usage in the code is as follows:

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nltk.data.path.append("/nltk_data")

This returns the following error from the Lambda log output

module initialization error: 
**********************************************************************
  Resource u'corpora/stopwords' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/sbx_user1062/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data'
**********************************************************************

I have also tried to load the data directly by including

nltk.data.load("/nltk_data/corpora/stopwords/english")

Which yields a different error below

module initialization error: Could not determine format for file:///stopwords/english based on its file
extension; use the "format" argument to specify the format explicitly.

It's possible that it has a problem loading the data from the Lambda zip and needs it stored externally.. say on S3, but that seems a bit strange.

Any idea what format the

Does anyone know where I could be going wrong?

Praxis
  • 934
  • 2
  • 17
  • 31
  • try `stopwords = nltk.corpus.stopwords.words('english')` and in the block of code it looks like it looks in the `nltk_data` folder for corpora.stopwords, but the intervening / is missing. That might just be a directory address issue. Not 100% sure this will work, because I cannot see your system or the file, but it otherwise looks OK – sconfluentus Feb 22 '17 at 04:46
  • Use the full path, e.g. `/home/sbx_user1062/nltk_data` and try: http://stackoverflow.com/a/22987374/610569 – alvas Feb 22 '17 at 08:12
  • If nothing works, see `magically_find_nltk_data()` from http://stackoverflow.com/questions/36382937/nltk-doesnt-add-nltk-data-to-search-path/36383314#36383314 – alvas Feb 22 '17 at 08:13
  • Thanks, I will try those suggestions and report back. One problem is that the user name eg: 'sbx_user1062' is different every time the AWS Lambda script is run. Which may mean that I need to locate the files at a static source on S3 unless I can find another way to specify the execution directory. – Praxis Feb 22 '17 at 08:42
  • Move the directory into a static asset and fix the `nltk_data` directory. A simple AWS Lambda service might not be sufficient, you would need some "AWS Simple Storage". – alvas Feb 22 '17 at 08:47
  • I'm not sure how that'll work but setting up a "serverless" system without storage won't exactly work when most machine-learning / NLP applications requires model/data loading. Try REST API with digital ocean droplet instead. – alvas Feb 22 '17 at 08:50
  • The 'serverless' part is just using NLTK for tokenizing words and loading them to an RDS instance for later analysis. It mostly works fine as the data is loaded into a StringIO object in memory before RDS storage. Lambda has worked nicely up until now so hopefully the NLTK library can be served from a static source. – Praxis Feb 22 '17 at 09:07
  • After trying all sorts of path configurations with not much progress, I have redefined the question and posted it here http://stackoverflow.com/questions/42394335/paths-in-aws-lambda-with-python-nltk – Praxis Feb 22 '17 at 14:32

4 Answers4

28

Another solution is to use Lambda's ephemeral storage at the location /tmp

So, you would have something like this:

import nltk

nltk.data.path.append("/tmp")
nltk.download("punkt", download_dir="/tmp")

At runtime punkt will download to the /tmp directory, which is writable. However, this likely isn't a great solution if you have huge concurrency.

machin
  • 440
  • 8
  • 21
Anonymous Juan
  • 396
  • 4
  • 8
  • 2
    Thanks it worked. This should have been the accepted answer. – Biranchi Oct 08 '19 at 09:33
  • Its a great solution. Why ? 1. It does the job 2. Lambda function has restriction for the size of the function and packages passed, even if it is loaded from S3. This approach helps to do not included required NLTK packages in the function package we want to upload, but instead, installing the packages are done in AWS platform. – Arash Mar 13 '20 at 01:31
  • It worked for me thx. I was looking for corpora so used nltk.download("popular", "/tmp") – Dileep Apr 11 '20 at 21:46
  • Worked for me, though I also was surprised to discover Lambda's tmp storage is not exactly ephemeral: undeleted files may build up, and indeed some people use it as an unreliable cache. – James Creasy Jan 07 '22 at 18:47
  • While a simple solution, this introduces potential hazards down the road. First, you're executing that download on each cold start, which a) slows down your cold start and b) could be a problem with high concurrency, as the answer itself points out. It also means your lambda can break unexpectedly if the content of the download changes. – Dausuul Aug 10 '22 at 17:38
16

I had the same problem before but I solved it using the environment variable.

  1. Execute "nltk.download()" and copy it to the root folder of your AWS lambda application. (The folder should be called "nltk_data".)

You can use following code for that

import nltk
nltk.download('punkt', download_dir='nltk_data/')

This will download 'punkit' to your root dir then put below in your dockerfile

COPY nltk_data ./nltk_data
  1. In the user interface of your lambda function (in the AWS console), you add "NLTK_DATA" = "./nltk_data". Please see the image. Configure NLTK DATA for AWS Lambda
Haroon
  • 480
  • 4
  • 14
jonathan_007
  • 319
  • 4
  • 5
  • 1
    This did not work for me, I've added a environment property into my elastic beanstalk env, it does add the path into the list of searched directories, but nothing is found. Any additional step needed to get it working? – leoschet Sep 03 '18 at 14:19
  • 1
    doesn't work in my case for Amazon Lambda function. Is there anything else needs to be done to get it working? – webdevbyjoss Apr 24 '19 at 21:52
  • 1
    this approach worked for me. my use case: 1. started venv, 2. pip installed zappa, 3. built a helper module that included nltk 4. `zappa deploy dev` was breaking due to nltk data not found. similar error logs showed: /sbx_user1055/nltk_data missing 5. found this answer; moved NLTK_DATA to my zappa dir 6. added ENV variable in deployed Lambda function 7. works like magic now thanks @jonathan_007! – MorningHacker Jun 05 '20 at 22:01
2

on AWS Lambda you need to include nltk python package with lambda and modify data.py:

path += [
    str('/usr/share/nltk_data'),
    str('/usr/local/share/nltk_data'),
    str('/usr/lib/nltk_data'),
    str('/usr/local/lib/nltk_data')
]

to

path += [
    str('/var/task/nltk_data')
    #str('/usr/share/nltk_data'),
    #str('/usr/local/share/nltk_data'),
    #str('/usr/lib/nltk_data'),
    #str('/usr/local/lib/nltk_data')
]

You cant include the entire nltk_data directory, delete all the zip files, and if you only need stopwords, save nltk_data -> corpora -> stopwords and dump the rest. If you need tokenizers save nltk_data -> tokenizers -> punkt. To download the nltk_data folder use anaconda Jupyter notebook and run

nltk.download()

or

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip

or

python -m nltk.downloader all
  • where is the data.py that needs to be modified? – Pat Needham Feb 09 '18 at 19:25
  • its in the nltk package, the nltk directory needs to be in the root of the function itself. This is separate from the nltk_data directory. If you use virtualenv it will be in (env)/lib/(python version)/site-packages/nltk In your lambda_function.py (generally defaulted py filename) add: import nltk – Phillip Viau Mar 26 '18 at 18:09
  • Where does the nltk_data directory go? Do I put it under the nltk package? – iCHAIT Dec 06 '19 at 05:11
1

If your stopwords corpus is under /nltk_data (based on root, not under your home directory), you need to tell the nltk before you try to access a corpus:

from nltk.corpus import stopwords
nltk.data.path.append("/nltk_data")

stopwords = stopwords.words('english')
alexis
  • 48,685
  • 16
  • 101
  • 161
  • i think the OP's problem is deeper than it seems. Serverless systems assumes that everything can be done in code with minimal external resource (data/models) that falls on the HDD. – alvas Feb 23 '17 at 01:00
  • This code should throw `NameError: name 'nltk' is not defined`. Please correct me if I'm wrong. – Dr. House Feb 26 '20 at 22:05
  • If that's the only code you run, sure. But (a) this is based on the OP's similarly incomplete snippet, and (b) it's just showing the order of the relevant statements, not a complete MWE. – alexis Feb 27 '20 at 14:56