3
  • standard dataproc image 2.0
  • Ubuntu 18.04 LTS
  • Hadoop 3.2
  • Spark 3.1

I am testing to run a very simple script on dataproc pyspark cluster:

testing_dep.py

import os
os.listdir('./')

I can run testing_dep.py in a client mode (default on dataproc) just fine:

gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1

But, when I try to run the same job in cluster mode I get error:

gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --properties=spark.submit.deployMode=cluster

error logs:

Job [417443357bcd43f99ee3dc60f4e3bfea] submitted.
Waiting for job output...
22/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at monsoon-testing-m/10.128.15.236:8032
22/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at monsoon-testing-m/10.128.15.236:10200
22/01/12 05:32:22 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
22/01/12 05:32:22 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
22/01/12 05:32:24 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1641965080466_0001
22/01/12 05:32:42 ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: Application application_1641965080466_0001 failed 2 times due to AM Container for appattempt_1641965080466_0001_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: [2022-01-12 05:32:42.154]Exception from container-launch.
Container id: container_1641965080466_0001_02_000001
Exit code: 13

[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
22/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
java.lang.IllegalStateException: User did not initialize spark context!
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)


[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
22/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
java.lang.IllegalStateException: User did not initialize spark context!
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)


For more detailed output, check the application tracking page: http://monsoon-testing-m:8188/applicationhistory/app/application_1641965080466_0001 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1641965080466_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1242)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [417443357bcd43f99ee3dc60f4e3bfea] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/417443357bcd43f99ee3dc60f4e3bfea?project=monsoon-credittech&region=us-central1
gcloud dataproc jobs wait '417443357bcd43f99ee3dc60f4e3bfea' --region 'us-central1' --project 'monsoon-credittech'
https://console.cloud.google.com/storage/browser/monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/
gs://monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/driveroutput



Can you please help me understand what I am doing wrong and why this code is failing?

Dagang
  • 24,586
  • 26
  • 88
  • 133
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
  • Could it be that that abz.zip must be in a location that is reachable from all cluster nodes? You provide only a relative location, that is valid only on your local machine. – Til Piffl Jan 11 '22 at 14:34
  • But, my code is not dependent upon abc.zip. It just prints the contents of the working directory. – figs_and_nuts Jan 11 '22 at 14:36
  • I assume that does not matter since the archives are distributed before the python code is executed. Does the code work without the archives flag? – Til Piffl Jan 11 '22 at 14:41
  • No. But, I restarted my cluster and the error has changed. Beats me what is happening. I am updating the question to show the error. The code is failing with or without --archives in cluster mode – figs_and_nuts Jan 11 '22 at 17:20
  • Can you please take a look? – figs_and_nuts Jan 11 '22 at 17:22
  • I have looked more into this and this has nothing to do with ```--archives``` and changing the spark configuration. I create a simple 3 nodes vanilla cluster with standard dataproc image and ```os.listdir()``` fails in cluster mode. Updating the question with that so there isn't much to follow for anyone landing here – figs_and_nuts Jan 12 '22 at 05:34
  • My next guess :) - your code might actually have been executed, but you don't see it (1) has no print-statement and (2) you do not see the stdout in cluster-mode because your driver code is executed not on your local machine, but in some other cluster node. The error message complains about not starting a spark session, but that might be some standard error when no SparkSession.builder().getOrCreate() statement is there. – Til Piffl Jan 12 '22 at 07:24
  • Should I not get that error in the case of client mode as well if that is the case? – figs_and_nuts Jan 12 '22 at 08:13

1 Answers1

1

The error is expected when running Spark in YARN cluster mode but the job doesn't create Spark context. See the source code of ApplicationMaster.scala.

To avoid this error, you need to create a SparkContext or SparkSession, e.g.:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .appName('MySparkApp') \
                    .getOrCreate()

Client mode doesn't go through the same code path and doesn't have a similar check.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • Yes this solved the issue. It is weird that I do not get the error even when I create the ```SparkSession``` after the ```os.listdir()``` part of the code. I placed my ```SparkSession.builder.getOrCreate() ``` at the end of the script and it still worked. – figs_and_nuts Jan 22 '22 at 20:41