how spark distribute training tasks to evenly across executors?

Question

I have set up a spark standalone cluster with 4 worker(each have 4 cores) and 1 master. Each have window 10 operating system. i submitted spark's ML example: multilayer_perceptron_classification.py to our spark standalone cluster. But it is executing all tasks to one executor on one worker. enter image description here

multilayer_perceptron_classification.py code is(Code use Spark MLlib):

spark = SparkSession\
    .builder.appName("multilayer_perceptron_classification_example").getOrCreate()

data = spark.read.format("libsvm")\
    .load("C:/spark/spark-2.3.2-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data1.txt")

splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]   
test = splits[1] 

layers = [4, 500, 500, 500, 3]

trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

model = trainer.fit(train)

result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

spark.stop()

i don't know why it is running only one computer. I want to know whether training algorithm is originally structured serially or i missed some configuration of spark cluster.(i thought spark cluster do distributed training but is is not) Please help me. Thank you in advance.

score 2 · Accepted Answer · answered Feb 26 '19 at 18:34

Check the number of partitions (data.rdd.partitions.size), most likely it is 1. The unit of parallelization in Spark in partition. Spark won't be using more executors than the number of data partitions.

To fix this, either split your data in sample_multiclass_classification_data1.txt in multiple files or re-partition it

num_partitions = 32
data = spark.read.format("libsvm")\
    .load("C:/spark/spark-2.3.2-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data1.txt").repartition(num_partitions)

Related question: Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Denis Makarenko, thank you for your answer. i used re-partitioning you suggested. but it is re partitioning all my data into single executor. i want to distribute those partitions to all executors. Even i have 4 executors each on one worker. — GTR TOGTOKH, Feb 27 '19 at 06:47

how spark distribute training tasks to evenly across executors?

1 Answers1