46

I am building a docker container using the following Dockerfile:

FROM ubuntu:14.04

RUN apt-get update

RUN apt-get install -y python python-dev python-pip

ADD . /app

RUN apt-get install -y python-scipy

RUN pip install -r /arrc/requirements.txt

EXPOSE 5000

WORKDIR /app

CMD python app.py

Everything goes well until I run the image and get the following error:

**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

I have had this problem before and it is discussed here however I am not sure how to approach it using Docker. I have tried:

CMD python
CMD import nltk
CMD nltk.download()

as well as:

CMD python -m nltk.downloader -d /usr/share/nltk_data popular

But am still getting the error.

alvas
  • 115,346
  • 109
  • 446
  • 738
GNMO11
  • 2,099
  • 4
  • 19
  • 28
  • 1
    this is wrong "CMD python CMD import nltk CMD nltk.download()" it is the same as open a terminal, type `python`, open another terminal, type `import nltk` and so (of course the second command will fail, as it is not in Python – user2915097 Jun 30 '15 at 16:21
  • 2
    maybe `RUN python -c 'import nltk ; nltk.download()'` or something like that (I am not sure of the syntax) – user2915097 Jun 30 '15 at 16:23

5 Answers5

55

In your Dockerfile, try adding instead:

RUN python -m nltk.downloader punkt

This will run the command and install the requested files to //nltk_data/

The problem is most likely related to using CMD vs. RUN in the Dockerfile. Documentation for CMD:

The main purpose of a CMD is to provide defaults for an executing container.

which is used during docker run <image>, not during build. So other CMD lines probably were overwritten by the last CMD python app.py line.

cchi
  • 991
  • 9
  • 16
  • do you know if I just COPY the nltk_data folder, do I need to copy both uncompressed folders and zip files or only zipfiles? – perrohunter Aug 01 '17 at 21:57
  • 4
    I am using this approach but I am getting this error: /usr/local/lib/python3.6/runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour – xzegga Feb 01 '18 at 21:37
  • Here's more info from NLTK to pair with the command above. https://www.nltk.org/data.html#command-line-installation – Chris Farr Jun 29 '18 at 16:04
  • 3
    I also got the warning when building the docker container so I instead used `RUN python -c "import nltk; nltk.download('punkt')"` which builds without warning – Johann May 07 '20 at 07:33
12

well I tried all the methods suggested but nothing worked so I realized that nltk module searched in /root/nltk_data

step 1: i downloaded the punkt on my machine by using

python3
>>import nltk
>>nltk.download('punkt')

And the punkt was in /root/nltk_data/tokenizer

step 2: i copied tokenizer folder to my director and my directory looked something like this

.
|-app/
|-tokenizers/
|--punkt/
|---all those pkl files
|--punkt.zip

and step 3: then i modified the Dockerfile which copied that to my docker instance

COPY ./tokenizers /root/nltk_data/tokenizers

step 4: The new instance had punkt

idris
  • 291
  • 3
  • 7
  • 1
    This deserves more upvotes. It was the simplest solution to the problem and the only one that worked for me. My Docker: ```COPY ./nltk_data /usr/local/nltk_data``` – GDB Nov 08 '20 at 02:20
3

I was facing same issue when I was creating docker image with ubuntu image and python3 for django application.

I resolved as shown below.

# start from an official image
FROM ubuntu:16.04

RUN apt-get update \
  && apt-get install -y python3-pip python3-dev \
  && apt-get install -y libmysqlclient-dev python3-virtualenv

# arbitrary location choice: you can change the directory
RUN mkdir -p /opt/services/djangoapp/src
WORKDIR /opt/services/djangoapp/src

# copy our project code
COPY . /opt/services/djangoapp/src

# install dependency for running service
RUN pip3 install -r requirements.txt
RUN python3 -m nltk.downloader punkt
RUN python3 -m nltk.downloader wordnet

# Setup supervisord
RUN mkdir -p /var/log/supervisor
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf

# Start processes
CMD ["/usr/bin/supervisord"]
Shree Prakash
  • 2,052
  • 2
  • 22
  • 33
2

I got this to work for google cloud build by indicating a download destination within the container.

RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]

Full Dockerfile

FROM python:3.8.3

WORKDIR /app

ADD . /app

# install requirements
RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir --compile -r requirements.txt

RUN [ "python3", "-c", "import nltk; nltk.download('punkt', download_dir='/usr/local/nltk_data')" ]

CMD exec uvicorn --host 0.0.0.0 --port $PORT main:app
E G
  • 498
  • 6
  • 7
0

Currently I had to do this: see RUN cp -r /root/nltk_data /usr/local/share/nltk_data

FROM ubuntu:latest
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    sudo \
    python3 \
    build-essential \
    python3-pip \
    python3-setuptools \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*
 
RUN pip3 install --upgrade pip

ENV PYTHONPATH "${PYTHONPATH}:/app"

ADD requirements.txt .
# in requirements.txt: pandas, numpy, wordcloud, matplotlib, nltk, sklearn

RUN pip3 install -r requirements.txt 
RUN [ "python3", "-c", "import nltk; nltk.download('stopwords')" ]
RUN [ "python3", "-c", "import nltk; nltk.download('punkt')" ]
RUN cp -r /root/nltk_data /usr/local/share/nltk_data 

RUN addgroup --system app \
    && adduser --system --ingroup app app

WORKDIR /home/app
ADD inputfile .
ADD script.py . 
# the script uses the python modules: pandas, numpy, wordcloud, matplotlib, nltk, sklearn

RUN chown app:app -R /home/app
USER app
RUN python3 script.py inputfile outputfile
Ferroao
  • 3,042
  • 28
  • 53