After building dockerfile: ModuleNotFoundError: No module named 'numpy'

Question

I have to run the python program in Redhat8. So I pull Redhat docker image and write a Dockerfile which is in the following:

  FROM redhat/ubi8:latest
  RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf && mkdir /home/spark && mkdir /home/spark/spark && mkdir /home/spark/ETL && mkdir /usr/lib/java && mkdir /usr/share/oracle

  # set environment vars
  ENV SPARK_HOME /home/spark/spark
  ENV JAVA_HOME /usr/lib/java

  # install packages
  RUN \
    echo "nameserver 9.9.9.9" >> /etc/resolv.conf && \
    yum install -y rsync && yum install -y wget && yum install -y python3-pip && yum 
    install -y openssh-server && yum install -y openssh-clients && \
    yum install -y unzip && yum install -y python38 && yum install -y nano 
    # create ssh keys
    RUN \
    echo "nameserver 9.9.9.9" >> /etc/resolv.conf && \
    ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
    chmod 0600 ~/.ssh/authorized_keys
    # copy ssh config
    COPY ssh_config /root/.ssh/config
    COPY spark-3.1.2-bin-hadoop3.2.tgz /home/
    COPY jdk-8u25-linux-x64.tar.gz  /home/
    COPY instantclient-basic-linux.x64-19.8.0.0.0dbru.zip /home
    COPY etl /home/ETL/

    RUN \
    tar -zxvf /home/spark-3.1.2-bin-hadoop3.2.tgz -C /home/spark && mv -v 
    /home/spark/spark-3.1.2-bin-hadoop3.2/* $SPARK_HOME && tar -zxvf /home/jdk-8u25-linux-x64.tar.gz -C /home/spark && mv -v /home/spark/jdk1.8.0_25/* $JAVA_HOME && unzip /home/instantclient-basic-linux.x64-19.8.0.0.0dbru.zip -d /home/spark && mv -v /home/spark/instantclient_19_8 /usr/share/oracle && echo "export JAVA_HOME=$JAVA_HOME" >> ~/.bashrc && \
   echo "export PATH=$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:/usr/share/oracle/instantclient_19_8" >> ~/.bashrc && echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/share/oracle/instantclient_19_8" >> ~/.bashrc && echo "PYTHONPATH = $PYTHONPATH:/usr/bin/python3.8" >> ~/.bashrc && echo "alias python=/usr/bin/python3.8" >> ~/.bashrc

  #WARNING: Running pip install with root privileges is generally not a good idea. Try `python3.8 -m pip install --user` instead.
  # so I have to create a user
  RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf 
  RUN useradd -d /home/spark/myuser myuser
  USER myuser
  WORKDIR /home/spark/myuser
  ENV PATH="/home/spark/myuser/.local/bin:$PATH"
  RUN \  
  python3.8 -m pip install --user pandas && \ 
  python3.8 -m pip install --user cx-Oracle && \
  python3.8 -m pip install --user persiantools && \
  python3.8 -m pip install --user pyspark && \
  python3.8 -m pip install --user py4j && \
  python3.8 -m pip install --user python-dateutil && \
  python3.8 -m pip install --user pytz && \
  python3.8 -m pip install --user setuptools && \
  python3.8 -m pip install --user six && \
  python3.8 -m pip install --user numpy


   # copy spark configs
  ADD spark-env.sh $SPARK_HOME/conf/
  ADD workers $SPARK_HOME/conf/
  
  # expose various ports
  EXPOSE 7012 7013 7014 7015 7016 8881 8081 7077

Also, I copy and build the dockerfile with this script:

  #/bin/bash

  cp /etc/ssh/ssh_config .
  cp /opt/spark/conf/spark-env.sh .
  cp /opt/spark/conf/workers .
  sudo docker build -t my_docker .
  echo "Script Finished."

The dockerfile built without any error. Then I make a tar file from the image that made with this command:

  sudo docker save my_docker > my_docker.tar

After that I copy the my_docker.tar to the another computer and load it:

  sudo docker load < my_docker.tar
  sudo docker run -it my_docker

Unfortunately, when I run my program inside docker container, I receive errors about python package like numpy,pyspark,pandas.

  File "/home/spark/ETL/test/main.py", line 3, in <module>
  import cst_utils as cu
  File "/home/spark/ETL/test/cst_utils.py", line 5, in <module>
  import group_state as gs
  File "/home/spark/ETL/test/group_state.py", line 1, in <module>
  import numpy as np
  ModuleNotFoundError: No module named 'numpy'

I also try to install the python packages in the docker container and then commit the container.But, when I exit from the container and enter again, there is no python package installed.

Would you please guide what is wrong with my way?

Any help is really appreciated.

I don't see the `CMD` in the Dockerfile; what is the main container process? How much of the Dockerfile is actually necessary to reproduce the issue? — David Maze, Feb 23 '22 at 17:41
By the way, `pip install` accepts multiple parameters at once... And why do you keep editing the resolv.conf? — OneCricketeer, Feb 23 '22 at 17:44
@DavidMaze I assume OP is doing `docker run` / `exec`, then `python` or `spark-submit` — OneCricketeer, Feb 23 '22 at 17:50
Dear @OneCricketeer, thank you for your feedback. I test the redhat image that if set **nameserver 9.9.9.9** in **/etc/resolv.conf**, I can install the package via internet.Also, every time that I use **RUN** command **/etc/resolv.conf** create a gain, so I add **nameserver 9.9.9.9** in **/etc/resolv.conf** file each time that I want to install packages. And you are right, I want to use ```docker run / exec ``` and then ```spark-submit``` to run the program. — M_Gh, Feb 24 '22 at 07:53
The Docker container should use your hosts own DNS resolver, so sounds like you have a separate network issue. No RUN command that you're using should be changing that file either — OneCricketeer, Feb 24 '22 at 14:22
The canonical question for this problem on ***Windows*** may be *[Error "Import Error: No module named numpy" on Windows](https://stackoverflow.com/questions/7818811/)* (2011, 40 answers and 300 votes). — Peter Mortensen, Aug 22 '22 at 15:11

score 1 · Answer 1 · answered Feb 23 '22 at 17:47

1

Aside from any issues with the Dockerfile setup itself,

In your spark-env.sh, set these to make sure it is using the same environment where pip had installed to

export PYSPARK_PYTHON="/usr/bin/python3.8"
export PYSPARK_DRIVER_PYTHON="/usr/bin/python3.8"

Keep in mind that SparkSQL Dataframes should really be used instead of numpy, and you don't need to pip install pyspark since it is already part of the downloaded spark package.

answered Feb 23 '22 at 17:47

OneCricketeer

179,855
19
132
245

I edit **spark-env.sh** as you said, but the problem stay the same. – M_Gh Feb 24 '22 at 10:03
I think you should focus on the actual code. What specific reason are you using numpy rather than SparkSQL? – OneCricketeer Feb 24 '22 at 14:25
I don't have experience about SparkSQL, so I use numpy and pandas Dataframe. – M_Gh Feb 24 '22 at 14:46
Then you're not really "using Spark" properly then. Numpy/Pandas are not distributed until you add Koalas https://koalas.readthedocs.io/en/latest/getting_started/10min.html – OneCricketeer Feb 24 '22 at 18:22
Dear @OneCricketeer, thank you for suggestion, I'll check Koalas. – M_Gh Feb 26 '22 at 06:54

score 1 · Answer 2 · answered Feb 24 '22 at 07:22

I played around with your code, removing most stuff that seemed (to me) irrelevant to the problem.

I found that moving

echo "alias python=/usr/bin/python3.8" >> ~/.bashrc

down, after USER myuser solved it. Before it I got python not found, and python3 turned out not to have have numpy either, whereas python3.8 did. So there was some confusion there, maybe in your full example something happens that obscures this even more.

But try to move that statement because ~/.bashrc is NOT the same when you change user.

I did what you had told. In fact, when I run ```import numpy``` in **python environment**, I did not receive any error. But, when I run the program with ```spark-submit``` , I received **No module named 'numpy'**. — M_Gh, Feb 24 '22 at 10:03

score 1 · Accepted Answer · answered Feb 24 '22 at 17:44

Problem solved. I changed the Dockerfile. First, I did not define any user. Then, I set PYSPARK_PYTHON, so there was not error about importing any packages. The Dockerfile is like this:

 FROM redhat/ubi8:latest
 RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf 
 RUN mkdir /home/spark &&  mkdir /home/ETL && mkdir /usr/lib/java && mkdir /usr/share/oracle
 # set environment vars
 ENV SPARK_HOME /home/spark
 ENV JAVA_HOME /usr/lib/java
 # install packages
 RUN \
 echo "nameserver 9.9.9.9" >> /etc/resolv.conf && \
 yum -y update  && \
 yum install -y libaio && \
 yum install -y libaio.so.1 && \
 dnf install libnsl* && \
 yum install -y rsync && yum install -y wget && yum install -y python3-pip && yum install -y openssh-server && yum install -y openssh-clients && \
 yum install -y unzip && yum install -y python38 && yum install -y nano 
 #WARNING: Running pip install with root privileges is generally not a good idea. Try `python3.8 -m pip install --user` instead.
 # It is just a warning
 RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf && \
 python3.8 -m pip install pandas && \ 
 python3.8 -m pip install cx-Oracle && \
 python3.8 -m pip install persiantools && \
 python3.8 -m pip install pyspark && \
 python3.8 -m pip install py4j && \
 python3.8 -m pip install python-dateutil && \
 python3.8 -m pip install pytz && \
 python3.8 -m pip install setuptools && \
 python3.8 -m pip install numpy && \
 python3.8 -m pip install six
 # create ssh keys
 RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf && \
 ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
 chmod 0600 ~/.ssh/authorized_keys

  # copy ssh config
 COPY ssh_config /root/.ssh/config
 COPY spark-3.1.2-bin-hadoop3.2.tgz /home
 COPY jdk-8u25-linux-x64.tar.gz  /home
 COPY instantclient-basic-linux.x64-21.4.0.0.0dbru.zip /home
 COPY etl /home/ETL/
 RUN \
 tar -zxvf /home/spark-3.1.2-bin-hadoop3.2.tgz -C /home && mv -v /home/spark-3.1.2-bin-hadoop3.2/* $SPARK_HOME && tar -zxvf /home/jdk-8u25-linux-x64.tar.gz -C /home && mv -v /home/jdk1.8.0_25/* $JAVA_HOME && unzip /home/instantclient-basic-linux.x64-21.4.0.0.0dbru.zip -d /home 
 RUN \  
 echo "export JAVA_HOME=$JAVA_HOME" >> ~/.bashrc && \
 echo "export SPARK_HOME=$SPARK_HOME" >> ~/.bashrc && \
 echo "export PATH=$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:/home/instantclient_21_4" >> ~/.bashrc && echo "export LD_LIBRARY_PATH=/home/instantclient_21_4" >> ~/.bashrc && echo "alias python=/usr/bin/python3.8" >> ~/.bashrc && \
 echo "export PYTHONPATH=$SPARK_HOME/python:/usr/bin/python3.8" >> ~/.bashrc && echo "export PYSPARK_PYTHON=/usr/bin/python3.8" >> ~/.bashrc

 ENV LD_LIBRARY_PATH="/home/instantclient_21_4"
 ENV PYTHONPATH = "$SPARK_HOME/python:/usr/bin/python3.8"
 ENV PATH="$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:/home/instantclient_21_4"

 RUN \
 touch /etc/ld.so.conf.d/instantclient.conf && \
 echo "#path of instant client" >> /etc/ld.so.conf.d/instantclient.conf && \
 echo "/home/instantclient_21_4" >> /etc/ld.so.conf.d/instantclient.conf && \
 ldconfig

 # copy spark configs
 ADD spark-env.sh $SPARK_HOME/conf/
 ADD workers $SPARK_HOME/conf/

 # expose various ports
 EXPOSE 7012 7013 7014 7015 7016 8881 8081 7077

I hope it was useful for others.

After building dockerfile: ModuleNotFoundError: No module named 'numpy'

3 Answers3