This will depend on the structure of your project and the type of dependencies you have.
If your project just has multiple .py
files and no external dependencies, you can upload those files to S3 and pass them to the job using the spark.submit.pyFiles
Spark property. One thing to be aware of here is that if your local project is structured with directories, you'll need to zip up those files and upload the zip instead.
First, if your just have a flat structure, you upload both your main PySpark script and the additional Python files to S3.
.
├── job1.py
└── main.py
1 directory, 2 files
You can do this use aws s3 sync
:
aws s3 sync . s3://<BUCKET_NAME>/prefix/
Then, when you create your EMR Serverless job in the console, you provide s3://<BUCKET_NAME>/prefix/main.py
as the script location, and in the "Spark properties" section, you add spark.submit.pyFiles
with a comma-separated list of your .py
files.
spark.submit.pyFiles s3://<BUCKET_NAME>/prefix/job1.py
If you have your modules in directories, you'll need to zip up those directories, upload them to S3 and provide the zip file as the path to spark.submit.pyFiles
. It's very important to have __init__.py
in your directories for Python/PySpark to recognize them as modules.
.
├── jobs
│ ├── __init__.py
│ └── job1.py
└── main.py
2 directories, 3 files
If you have zip
installed locally:
zip -r jobs.zip jobs
aws s3 cp main.py s3://<BUCKET_NAME>/prefix/
aws s3 cp jobs.zip s3://<BUCKET_NAME>/prefix/
spark.submit.pyFiles s3://<BUCKET_NAME>/prefix/jobs.zip
If you're using third-party dependencies, check out this other similar question: How to run a Python project (package) on AWS EMR serverless?
Finally, there's also a new emr-cli
project under development that makes deploying and running a job on EMR Serverless as easy as one command. It will automatically detect the additional .py
files, zip them up, upload them to S3 and provide the right parameters to EMR Serverless.
emr run \
--application-id <APPLICATION_ID> \
--job-role arn:aws:iam::<ACCOUNT_ID>:role/<JOB_ROLE> \
--s3-code-uri s3://<BUCKET_NAME>/<PREFIX> \
--entry-point main.py \
--build --wait