Spark has many configurable options. Here, I would like to know what the optimal configuration is under certain constraints.
I have seen many of these post and do not think the approach of neglecting the structure of the data can yield in a satisfactory solution.
Cluster Config
We will set the already established --executor-cores 5
, because of the previous research done. Let us set another constraint such that the --executor-memory 60 Gb
is the threshold maximum. This may be expressed as --executor-memory
= min(60 Gb,EM
).
We fix the number of nodes in our cluster to N_0
, which implicitly regulates the --num-executors
(equal to N_0 * average num-cores on node / 5
).
Data Config
We are presented with data in the form of FN_0
-many text files of equal size FS
(approx. 1 Gb) loaded into an RDD
. This RDD
has initially a partiton number PN
equal to FN_0
. Loading all the files into the RDD results in records RN = RDD.count()
.
Question
I would like to find a qualitative expression or optimal solution for the --executor-memory
, --num-executors
and partition number PN
for an Input -> Map -> Filter -> Action job, in terms of N_0,FN_0,FS,RN
. What is their inter-dependency?
My assumption is that the partition number would be ideal at RN
(approx. 100.000), so that every record has its own task, but this shuffle would scale astronomically. I would also appreciate any thoughts in regards to the relationship between he product FN_0 * FS
and --executor-memory
.