1

I have a script using NLTK on a lambda service. I use a pipeline to automate all the development steps. When a new commit occurs on a GitHub repository, AWS CodeBuild processes the project and implement it on my Lambda function.

The script

  • Environment: Python 3.6.5
  • Use nltk with the packages stopwords and wordnet

I use this solution for my code: Installing NLTK/WORDNET on AWS Lambda via CodeBuild

version: 0.2
phases:
 install:
   commands:
     - echo "install step"
     - apt-get update
     - apt-get install zip -y
     - apt-get install python3-pip -y
     - pip install --upgrade pip
     - pip install --upgrade awscli
     # Define directories
     - export HOME_DIR=`pwd`
     - export NLTK_DATA=$HOME_DIR/nltk_data
 pre_build:
   commands:
     - echo "pre_build step"
     - cd $HOME_DIR
     - virtualenv venv
     - . venv/bin/activate
     # Install modules
     - pip install -U requests
     # NLTK download
     - pip install -U nltk
     - python -m nltk.downloader -d $NLTK_DATA wordnet stopwords
     - pip freeze > requirements.txt
 build:
   commands:
     - echo 'build step'
     - cd $HOME_DIR
     - mv $VIRTUAL_ENV/lib/python3.6/site-packages/* .
     - sudo zip -r9 algo.zip .
     - aws s3 cp --recursive --acl public-read ./ s3://hilightalgo/
     # Put the zip on the lambda function
     - aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip
 post_build:
   commands:
     - echo "Build: end"

the different steps work well. There are no errors but when I try to use my Lambda function, it seems like I do not have the nltk data. See below the result of lambda execution:

{"errorMessage":"\n**********************************************************************\n Resource \u001b[93mstopwords\u001b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \u001b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \u001b[0m\n Attempted to load \u001b[93mcorpora/stopwords\u001b[0m\n\n Searched in:\n - '/home/sbx_user1060/nltk_data'\n - '/var/lang/nltk_data'\n - '/var/lang/share/nltk_data'\n - '/var/lang/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n","errorType":"LookupError","stackTrace":[" File \"/var/task/lambda_function.py\", line 13, in lambda_handler\n return preprocessing.find_sentences('twitter.txt', 'english')\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 100, in find_sentences\n (data, data_stopwords) = sentence_tokenize(file, language)\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 52, in sentence_tokenize\n stop_words = set(stopwords.words(language))\n"," File \"/var/task/nltk/corpus/util.py\", line 123, in __getattr__\n self.__load()\n"," File \"/var/task/nltk/corpus/util.py\", line 88, in __load\n raise e\n"," File \"/var/task/nltk/corpus/util.py\", line 83, in __load\n root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))\n"," File \"/var/task/nltk/data.py\", line 699, in find\n raise LookupError(resource_not_found)\n"]}

I don't know why lambda doesn't find the nltk data. Does anyone have an idea to solve my problem?

Louis Singer
  • 767
  • 1
  • 9
  • 18

2 Answers2

2

According to the error message, NLTK searches in these directories for the corpora:

Searched in:
 - '/home/sbx_user1060/nltk_data'
 - '/var/lang/nltk_data'
 - '/var/lang/share/nltk_data'
 - '/var/lang/lib/nltk_data'
 - '/usr/share/nltk_data'
 - '/usr/local/share/nltk_data'
 - '/usr/lib/nltk_data'
 - '/usr/local/lib/nltk_data'

However, in the Lambda execution environment, the access to the file system is somewhat constrained; these might not even be present, let alone readable to your code. Furthermore, your code (the .zip archive you create) is extracted to /var/task. That's basically the home directory.

Luckily, it seems you can let nltk know where to look for the corpora by setting an environment variable. If I understand your build process correctly, you bundle the NLTK corpora into a subdirectory nltk_data, next to your python code and the required libraries. So in the Lambda execution environment, it will be found at /var/task/nltk_data.

Hence, try setting the NLTK_DATA environment variable for your function at the end of your CodeBuild process:

aws lambda update-function-configuration \
--function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight \
--environment 'Variables={NLTK_DATA=/var/task/nltk_data}'
Milan Cermak
  • 7,476
  • 3
  • 44
  • 59
  • 1
    Thank you it works! Note that it is necessary to modify the IAM Role of CodeBuild to use the command `update-function-configuration`. – Louis Singer Feb 03 '19 at 22:06
  • Having the same issue for cmudict corpora, but this answer is great. @Milan Cermak, do you know why I might now be seeing the error `[Errno 30] Read-only file system: '/var/task/nltk_data/corpora` ? – jimiclapton Jan 16 '21 at 17:32
0

in lambda function ==> configuration==> environment==> {'NLTK': '/var/task/nltk_data'}

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 08 '22 at 07:50