How can a PySpark shell with no worker nodes run jobs?

Question

I have run the below lines in the pypsark shell (mac, 8 cores).

import pandas as pd
df = spark.createDataFrame(pd.DataFrame(dict(a = list(range(1000)))
df.show()

I want to count my worker nodes (and see the number of cores on each), so I run the python commands in this post:

sc.getExecutorMemoryStatus().keys()
# JavaObject id=o151

len([executor.host() for executor in sc.statusTracker().getExecutorInfos() ]) -1
# 0

The above code indicates I have 1 worker. So, I checked the the spark UI I only have the driver with 8 cores:

Can work be done by the cores in the driver? If so, are 7 cores doing work and 1 is reserved for "driver" functionality? Why aren't worker nodes being created automatically?

score 1 · Accepted Answer · answered May 08 '22 at 07:56

1

It's not up to Spark to figure out the perfect cluster for the hardware you provide (although it's highly task-specific what is a perfect infrastructure anyway)

Actually, a behaviour you described is default for Spark to set up infrastructure like this if you run on YARN master (see spark.executor.cores option in the docs).

To modify it you have to either add some options while running pyspark-shell or do in inside your code with, for example:

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()

More on that can be found here and here.

answered May 08 '22 at 07:56

mckraqs

113
7

Thanks @mckraqs for the resources. Going through them now. – Michael Berk May 08 '22 at 10:22
So the cores on the driver are actually doing executor work, despite being on the driver? – Michael Berk May 08 '22 at 10:24
1

Yup, if no worker nodes are specified, master is doing the work too. – mckraqs May 09 '22 at 06:24

How can a PySpark shell with no worker nodes run jobs?

1 Answers1