I'm a newbie to spark. I am using python (pyspark) to write my program. I use groupByKey
function to transform key-value pairs to key-(list of values) pairs. I am running spark on a 64 cores computer and I try to utlize all the 64 cores by starting the program using the following command.
spark-submit --master local[64] my_program.py
However, I notice that while executing the groupByKey
function, only one core is being utilized. The data is quite big. So, why spark doesn't partition it into 64 partitions and do the reduction/grouping in 64 different cores?
Am I missing some important step for parallelization?
The relevant part of the code looks like this,
# Here input itself is a key-(list of values) pair. The mapPartitions
# function is used to return a key-value pair (variable x), from
# which another key-(list of values) pair is created (variable y)
x = input.mapPartitions(transFunc)
# x contains key value pair, such as [(k1, v1), (k1, v2), (k2, v3)]
y = x.groupByKey()
# y contains key-list of values pair such as [(k1, [v1, v2]), (k2, [v2])]