How to turn pip / pypi installed python packages into zip files to be used in AWS Glue

Question

I am working with AWS Glue and PySpark ETL scripts, and want to use auxiliary libraries such as google_cloud_bigquery as a part of my PySpark scripts.

The documentation states this should be possible. This previous Stack Overflow discussion, especially one comment in one of the answers seems to provide additional proof. However, how to do it is unclear to me.

So the goal is to turn the pip installed packages into one or more zip files, in order to be able to just host the packages on S3 and point to them like so:

s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip

How that should be done is not clearly stated anywhere I've looked.

i.e. how do I pip install a package and then turn it into a zip file that I can upload to S3 so PySpark can use it with such an S3 URL?

By using the command pip download I have been able to fetch the libs, but they are not .zip files by default but instead either .whl files or .tar.gz

..so not sure what to do to turn them into zip files that AWS Glue can digest. Maybe with .tar.gz I could just tar -xf them and then zip them back up, but how about whl files?

It seems there might be something related in the answer of the question, but I'm not sure if it's directly applicable: https://stackoverflow.com/questions/17486578/how-can-you-bundle-all-your-python-code-into-a-single-zip-file — herb, Mar 07 '18 at 14:10
These also seem related, I think I need to investigate them further: https://stackoverflow.com/questions/29495435/easiest-way-to-install-python-dependencies-on-spark-executor-nodes and https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f — herb, Mar 07 '18 at 17:14
The fact that AWS Glue forces you to create those ZIP files is pretty insane. — AlbertoAndreotti, Mar 31 '21 at 17:43

score 12 · Answer 1 · answered Mar 08 '18 at 14:50

So, after going through the materials I sourced in the comments over the past 48 hours, here's how I solved the issue.

Note: I use Python2.7 because that's what AWS Glue seems to ship with.

By following the instructions in E. Kampf's blog post "Best Practices Writing Production-Grade PySpark Jobs" and this stack overflow answer, and some tweaking due to random errors along the way, I did the following:

Create a new project folder called ziplib and cd into it:

mkdir ziplib && cd ziplib

Create a requirements.txt file with names of packages on each row.
Create a folder in it called deps:

mkdir deps

Create a new virtualenv environment with python 2.7 in the current folder:

virtualenv -p python2.7 .

Install the requirements into the folder deps, using ABSOLUTE paths (otherwise won't work):

bin/pip2.7 install -r requirements.txt --install-option --install-lib="/absolute/path/to/.../ziplib/deps"

cd into the deps folder and zip its contents into zip archive deps.zip in parent folder, and then cd out of the deps folder:

cd deps && zip -r ../deps.zip . && cd ..

..and so now I have a zip file which if I put onto AWS S3 and point it to from PySpark on AWS Glue, it seems to work.

HOWEVER... what I haven't been able to solve is that since some packages, such as the Google Cloud Python client libs, use what is known as Implicit Namespace Packages (PEP-420), they don't have the __init__.py files usually present in modules, and thus the import statements don't work. I'm at a loss here.

I am searching for the same as well. Will numpy / panda also work if packaged this way? Thanks — Yuva, Mar 14 '18 at 15:55
Herb, can you please explain more on "so now I have a zip file which if I put onto AWS S3 and point it to from PySpark on AWS Glue, it seems to work." How did you get the pandas or other pip libraries work for your pyspark. I am unable to use pip installed libraries pandas, holidays, etc in my Glue job. Thanks — Yuva, Mar 15 '18 at 11:47
@YuvaKumar based on this answer: https://stackoverflow.com/questions/46329561/aws-glue-python "Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported." — herb, Mar 15 '18 at 15:22
Ok, why did you use a zip file, what packages / libraries did you package into a zip file. Am curious to know since you mentioned "..it seems to work..", bcoz I believe all supported python packages are already available default in glue environment, (other than those exceptions) ? Please clarify. Thanks — Yuva, Mar 17 '18 at 06:15
I wonder why people continue to use AWS after these crapy setups. — AlbertoAndreotti, Mar 31 '21 at 17:44

How to turn pip / pypi installed python packages into zip files to be used in AWS Glue

1 Answers1

Linked