4

Used Livy to execute a script stored in S3 via a POST request launched from EMR. The script runs but it times out very quickly. I have tried editing the livy.conf configurations, but none of the changes seem to stick. This is the error that is returned:

java.lang.Exception: No YARN application is found with tag livy-batch-10-hg3po7kp in 120 seconds. Please check your cluster status, it is may be very busy.
org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239) org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236) scala.Option.getOrElse(Option.scala:121) org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236) org.apache.livy.Utils$$anon$1.run(Utils.scala:94)
Vishrant
  • 15,456
  • 11
  • 71
  • 120
Aaron Liang
  • 41
  • 1
  • 4

2 Answers2

3

This was a tricky problem to solve, but I was able to get it working with this command:

curl -X POST --data '{"proxyUser": "hadoop","file": "s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/hello.py", "jars": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/NQjc.jar"], "pyFiles": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/application.zip"], "archives": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/venv.zip#venv"], "driverMemory": "10g", "executorMemory": "10g", "name": "Name of Import Job here", "conf":{
"spark.yarn.appMasterEnv.SPARK_HOME": "/usr/lib/spark",
"spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.yarn.executorEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.requirements":"requirements.pip",
"spark.pyspark.virtualenv.bin.path": "virtualenv",
"spark.master": "yarn",
"spark.submit.deployMode": "cluster"}}' -H "Content-Type: application/json" http://MY-PATH--TO-MY--EMRCLUSTER:8998/batches

after running this script on the EMR cluster's master node to setup my dependencies after I cloned the repository containing the application files:

set -e
set -x

export HADOOP_CONF_DIR="/etc/hadoop/conf"
export PYTHON="/usr/bin/python3"
export SPARK_HOME="/usr/lib/spark"
export PATH="$SPARK_HOME/bin:$PATH"


# Set $PYTHON to the Python executable you want to create
# your virtual environment with. It could just be something
# like `python3`, if that's already on your $PATH, or it could
# be a /fully/qualified/path/to/python.
test -n "$PYTHON"

# Make sure $SPARK_HOME is on your $PATH so that `spark-submit`
# runs from the correct location.
test -n "$SPARK_HOME"

"$PYTHON" -m venv venv --copies
source venv/bin/activate
pip install -U pip
pip install -r requirements.pip
deactivate

# Here we package up an isolated environment that we'll ship to YARN.
# The awkward zip invocation for venv just creates nicer relative
# paths.
pushd venv/
zip -rq ../venv.zip *
popd

# Here it's important that application/ be zipped in this way so that
# Python knows how to load the module inside.
zip -rq application.zip application/

as per the instructions I provided here: Bundling Python3 packages for PySpark results in missing imports

If you run into any issues, it is helpful to check the Livy logs here:

/var/log/livy/livy-livy-server.out

as well as the logs that show up in the Hadoop Resource Manager UI that you can access from the link in the EMR console once you're tunneled into your EMR master node and have setup your web browser proxy.

One key aspect to this solution is that Livy is unable to upload files from the local master node when provided via the file, jars, pyFiles, or archives parameters due to the issue mentioned here: https://issues.apache.org/jira/browse/LIVY-222

So, I was able to work around that issue by referencing files that I uploaded to S3 by leveraging EMRFS. Also, with virtualenv (if you're using PySpark), it's really important to use the --copies parameter or you will end up with symlinks that can't be used from HDFS.

There are also issues with using virtualenv that have been reported here: https://issues.apache.org/jira/browse/SPARK-13587 that are associated with PySpark (which may not apply to you), so I needed to work around them by adding additional parameters. Some of them are also mentioned here: https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html

Regardless, because of the issues with Livy having problems with uploading local files, until I worked around the issue by referencing the files from S3 via EMRFS, Livy would fail because it was unable to upload the files to the staging directory. Also, when I tried providing absolute paths in HDFS instead of using S3, because the HDFS resources were owned by the hadoop user, not the livy user, livy was unable to access them and copy them into the staging directory for the job execution. So, referencing the files from S3 via EMRFS was necessary.

devinbost
  • 4,658
  • 2
  • 44
  • 57
0

The solution is that you have to check the code in SparkUtil.scala.

The configuration of GetOrCreate should be active. If not, the livy can't check and close the connection of Yarn.

And Example is :

val spark = SparkSession.builder().appName(appName).getOrCreate()

And also in my case, I have had some lines commented and that was the problem.

enter image description here

karel
  • 5,489
  • 46
  • 45
  • 50