Interaction of spark configurations with input structure

Question

Spark has many configurable options. Here, I would like to know what the optimal configuration is under certain constraints.

I have seen many of these post and do not think the approach of neglecting the structure of the data can yield in a satisfactory solution.

Cluster Config

We will set the already established --executor-cores 5, because of the previous research done. Let us set another constraint such that the --executor-memory 60 Gb is the threshold maximum. This may be expressed as --executor-memory = min(60 Gb,EM).

We fix the number of nodes in our cluster to N_0, which implicitly regulates the --num-executors (equal to N_0 * average num-cores on node / 5).

Data Config

We are presented with data in the form of FN_0-many text files of equal size FS (approx. 1 Gb) loaded into an RDD. This RDD has initially a partiton number PN equal to FN_0. Loading all the files into the RDD results in records RN = RDD.count().

Question

I would like to find a qualitative expression or optimal solution for the --executor-memory, --num-executors and partition number PN for an Input -> Map -> Filter -> Action job, in terms of N_0,FN_0,FS,RN. What is their inter-dependency?

My assumption is that the partition number would be ideal at RN (approx. 100.000), so that every record has its own task, but this shuffle would scale astronomically. I would also appreciate any thoughts in regards to the relationship between he product FN_0 * FS and --executor-memory.

Possible duplicate of [How to calculate the best numberOfPartitions for coalesce?](https://stackoverflow.com/questions/40865326/how-to-calculate-the-best-numberofpartitions-for-coalesce) — 10465355, Oct 23 '18 at 13:45
There was this "Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding". Which is a indicator for the partitioning being independent of the `executor`, but I do not see how it explains the dependency regrading the data-format/size. — Michael Paris, Oct 23 '18 at 14:14

Interaction of spark configurations with input structure

Cluster Config

Data Config

0 Answers0