I am working with AWS Glue and PySpark ETL scripts, and want to use auxiliary libraries such as google_cloud_bigquery
as a part of my PySpark scripts.
The documentation states this should be possible. This previous Stack Overflow discussion, especially one comment in one of the answers seems to provide additional proof. However, how to do it is unclear to me.
So the goal is to turn the pip install
ed packages into one or more zip files, in order to be able to just host the packages on S3 and point to them like so:
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
How that should be done is not clearly stated anywhere I've looked.
i.e. how do I pip install
a package and then turn it into a zip file that I can upload to S3 so PySpark can use it with such an S3 URL?
By using the command pip download
I have been able to fetch the libs, but they are not .zip files by default but instead either .whl files or .tar.gz
..so not sure what to do to turn them into zip files that AWS Glue can digest. Maybe with .tar.gz I could just tar -xf
them and then zip
them back up, but how about whl files?