This was a tricky problem to solve, but I was able to get it working with this command:
curl -X POST --data '{"proxyUser": "hadoop","file": "s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/hello.py", "jars": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/NQjc.jar"], "pyFiles": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/application.zip"], "archives": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/venv.zip#venv"], "driverMemory": "10g", "executorMemory": "10g", "name": "Name of Import Job here", "conf":{
"spark.yarn.appMasterEnv.SPARK_HOME": "/usr/lib/spark",
"spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.yarn.executorEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.requirements":"requirements.pip",
"spark.pyspark.virtualenv.bin.path": "virtualenv",
"spark.master": "yarn",
"spark.submit.deployMode": "cluster"}}' -H "Content-Type: application/json" http://MY-PATH--TO-MY--EMRCLUSTER:8998/batches
after running this script on the EMR cluster's master node to setup my dependencies after I cloned the repository containing the application files:
set -e
set -x
export HADOOP_CONF_DIR="/etc/hadoop/conf"
export PYTHON="/usr/bin/python3"
export SPARK_HOME="/usr/lib/spark"
export PATH="$SPARK_HOME/bin:$PATH"
# Set $PYTHON to the Python executable you want to create
# your virtual environment with. It could just be something
# like `python3`, if that's already on your $PATH, or it could
# be a /fully/qualified/path/to/python.
test -n "$PYTHON"
# Make sure $SPARK_HOME is on your $PATH so that `spark-submit`
# runs from the correct location.
test -n "$SPARK_HOME"
"$PYTHON" -m venv venv --copies
source venv/bin/activate
pip install -U pip
pip install -r requirements.pip
deactivate
# Here we package up an isolated environment that we'll ship to YARN.
# The awkward zip invocation for venv just creates nicer relative
# paths.
pushd venv/
zip -rq ../venv.zip *
popd
# Here it's important that application/ be zipped in this way so that
# Python knows how to load the module inside.
zip -rq application.zip application/
as per the instructions I provided here: Bundling Python3 packages for PySpark results in missing imports
If you run into any issues, it is helpful to check the Livy logs here:
/var/log/livy/livy-livy-server.out
as well as the logs that show up in the Hadoop Resource Manager UI that you can access from the link in the EMR console once you're tunneled into your EMR master node and have setup your web browser proxy.
One key aspect to this solution is that Livy is unable to upload files from the local master node when provided via the file, jars, pyFiles, or archives parameters due to the issue mentioned here: https://issues.apache.org/jira/browse/LIVY-222
So, I was able to work around that issue by referencing files that I uploaded to S3 by leveraging EMRFS. Also, with virtualenv (if you're using PySpark), it's really important to use the --copies parameter or you will end up with symlinks that can't be used from HDFS.
There are also issues with using virtualenv that have been reported here: https://issues.apache.org/jira/browse/SPARK-13587
that are associated with PySpark (which may not apply to you), so I needed to work around them by adding additional parameters. Some of them are also mentioned here: https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html
Regardless, because of the issues with Livy having problems with uploading local files, until I worked around the issue by referencing the files from S3 via EMRFS, Livy would fail because it was unable to upload the files to the staging directory. Also, when I tried providing absolute paths in HDFS instead of using S3, because the HDFS resources were owned by the hadoop user, not the livy user, livy was unable to access them and copy them into the staging directory for the job execution. So, referencing the files from S3 via EMRFS was necessary.