AWS EMR serverless - how to submit pyspark jobs (using console) with multiple files?

Question

Hi i am new to EMR serverless and trying to learn. I have a pyspark project which i want to run using EMR serverless. I tried using console but it is not letting me provide folder location as input. i can submit only one file , and when i try that - it complains about missing dependencies ( other .py files). Can someone please share steps? TYIA

Tried to run with main file, however it failed complaining about dependencies missing

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Mar 22 '23 at 16:45

score 1 · Answer 1 · answered Mar 27 '23 at 18:44

This will depend on the structure of your project and the type of dependencies you have.

If your project just has multiple .py files and no external dependencies, you can upload those files to S3 and pass them to the job using the spark.submit.pyFiles Spark property. One thing to be aware of here is that if your local project is structured with directories, you'll need to zip up those files and upload the zip instead.

First, if your just have a flat structure, you upload both your main PySpark script and the additional Python files to S3.

.
├── job1.py
└── main.py

1 directory, 2 files

You can do this use aws s3 sync:

aws s3 sync . s3://<BUCKET_NAME>/prefix/

Then, when you create your EMR Serverless job in the console, you provide s3://<BUCKET_NAME>/prefix/main.py as the script location, and in the "Spark properties" section, you add spark.submit.pyFiles with a comma-separated list of your .py files.

spark.submit.pyFiles       s3://<BUCKET_NAME>/prefix/job1.py

If you have your modules in directories, you'll need to zip up those directories, upload them to S3 and provide the zip file as the path to spark.submit.pyFiles. It's very important to have __init__.py in your directories for Python/PySpark to recognize them as modules.

.
├── jobs
│   ├── __init__.py
│   └── job1.py
└── main.py

2 directories, 3 files

If you have zip installed locally:

zip -r jobs.zip jobs
aws s3 cp main.py s3://<BUCKET_NAME>/prefix/
aws s3 cp jobs.zip s3://<BUCKET_NAME>/prefix/

spark.submit.pyFiles       s3://<BUCKET_NAME>/prefix/jobs.zip

If you're using third-party dependencies, check out this other similar question: How to run a Python project (package) on AWS EMR serverless?

Finally, there's also a new emr-cli project under development that makes deploying and running a job on EMR Serverless as easy as one command. It will automatically detect the additional .py files, zip them up, upload them to S3 and provide the right parameters to EMR Serverless.

emr run \
        --application-id <APPLICATION_ID> \
        --job-role arn:aws:iam::<ACCOUNT_ID>:role/<JOB_ROLE> \
        --s3-code-uri s3://<BUCKET_NAME>/<PREFIX> \
        --entry-point main.py \
        --build --wait

AWS EMR serverless - how to submit pyspark jobs (using console) with multiple files?

1 Answers1