1

I am trying to use pip to install libraries into a Python virtualenv, which resides on an AWS EMR master node. For some reason, sudo pip works fine, but non-sudo pip silently fails.

Some background:

  • I am launching an EMR cluster with version emr-5.19.0.
  • I am SSHing into the master node, which uses Amazon Linux AMI 2018.03.
  • By default, this OS has both Python 2.7 and 3.4 installed.
  • I created a new virtualenv, based on the already-installed Python 3.4.
  • I activated my new virtualenv, and verified that all paths point to my venv installation (not to the global Python installation), e.g. which python, which pip all look correct.

So, I create and activate my virtualenv as follows:

cd /home/ec2-user/my_app
virtualenv --python=python3.4 venv
source venv/bin/activate

This works. Next, I try to install a sample library as follows:

pip install numpy

The output is:

Collecting numpy
Installing collected packages: numpy
Successfully installed numpy-1.16.0

However, despite the output claiming success, import numpy produces an import error, and numpy doesn't show up in pip list or pip freeze. I have even drilled into path/to/venv/lib/python3.4/dist-packages and verified no numpy directory gets created.

Sadly, this does work:

sudo path/to/venv/bin/pip install numpy

The problem is: I don't want to use sudo, because that would defy best practices. However, it seems like most people are using sudo for this task (examples here and here), so perhaps this is just a requirement in an EMR environment?

Note: This issue only happens for some libraries. For instance, pyspark and geocoder install fine, but numpy and pandas silently fail.

Chris Cugliotta
  • 119
  • 2
  • 3
  • You need to show how you are running the code that imports numpy. Most probably, you are not doing so inside the virtualenv where you installed it. – Daniel Roseman Jan 22 '19 at 17:16
  • It may depend on the numpy libs being linked from lower system levels. – eagle33322 Jan 22 '19 at 17:16
  • python -m venv may be an updated version vs standalone virtualenv – eagle33322 Jan 22 '19 at 17:17
  • @DanielRoseman, I am confident my code was running 'inside' my venv, because I launched an interactive Python shell by calling the venv binary directly. From there, I simply tried `import numpy`, `import pyspark`, etc. FYI, I ended up solving this issue. Please see my answer posted below. – Chris Cugliotta Jan 22 '19 at 18:08

1 Answers1

2

I ended up figuring this out: pip was (sometimes, but not always) placing modules in a particular directory that wasn't on the Python path! This appears to be a known bug between Amazon Linux and pip.

For instance, numpy was getting placed at:

path/to/venv/lib/python3.4/dist-packages/numpy

However, pyspark was getting placed at:

path/to/venv/lib64/python3.4/dist-packages/pyspark

The latter directory is on the Python path, but the former was not. This is why import pyspark worked, but import numpy did not. We can force pip to install libraries into the appropriate directory as follows:

pip install numpy --target='/path/to/venv/lib/python3.4/dist-packages'

The command above solves my issue.

Chris Cugliotta
  • 119
  • 2
  • 3