12

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).

I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).

I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.

I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.

I have a very simply Python script processing data from HDFS like so:

import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")

rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")

data = rrd.map(lambda line: json.loads(line))

joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))

print joes.count()

And I am running a submit command like:

spark-submit atest.py --deploy-mode client --master yarn-client

What can I do to ensure the job runs in parallel across the cluster?

aaa90210
  • 11,295
  • 13
  • 51
  • 88

4 Answers4

9

Can you swap the arguments for the command? spark-submit --deploy-mode client --master yarn-client atest.py

If you see the help text for the command:

spark-submit

Usage: spark-submit [options] <app jar | python file>
MrChristine
  • 1,461
  • 13
  • 13
4

I believe @MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.

Rok
  • 613
  • 4
  • 17
0

Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.

Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py

Python Code :

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

conf = (SparkConf()
         .setMaster("yarn")
         .setAppName("retrieve data"))
sc = SparkContext(conf = conf)

sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")

parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()

Output : Its big so i am not pasting. But it runs perfect.

Avinav Mishra
  • 718
  • 9
  • 12
-2

It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

aaa90210
  • 11,295
  • 13
  • 51
  • 88