PySpark distributed processing on a YARN cluster

Question

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).

I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).

I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.

I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.

I have a very simply Python script processing data from HDFS like so:

import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")

rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")

data = rrd.map(lambda line: json.loads(line))

joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))

print joes.count()

And I am running a submit command like:

spark-submit atest.py --deploy-mode client --master yarn-client

What can I do to ensure the job runs in parallel across the cluster?

score 9 · Accepted Answer · answered Feb 08 '15 at 20:26

9

Can you swap the arguments for the command? spark-submit --deploy-mode client --master yarn-client atest.py

If you see the help text for the command:

spark-submit

Usage: spark-submit [options] <app jar | python file>

answered Feb 08 '15 at 20:26

MrChristine

1,461
13
13

score 4 · Answer 2 · answered Feb 09 '15 at 16:50

I believe @MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.

score 0 · Answer 3 · answered Oct 16 '17 at 07:06

Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.

Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py

Python Code :

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

conf = (SparkConf()
         .setMaster("yarn")
         .setAppName("retrieve data"))
sc = SparkContext(conf = conf)

sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")

parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()

Output : Its big so i am not pasting. But it runs perfect.

score -2 · Answer 4 · answered Jan 30 '15 at 05:25

-2

It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

answered Jan 30 '15 at 05:25

aaa90210

11,295
13
51
88

4

I guess this is not true, Pyspark can run on a yarn cluster. – ᐅdevrimbaris Jul 24 '15 at 13:22
If you want to run Pyspark. Try: pyspark --deploy-mode client --master yarn-client – kennyut Mar 22 '16 at 20:02

PySpark distributed processing on a YARN cluster

4 Answers4

spark-submit