Spark performance tuning configuration for 10 Million data

Question

We are running our SPARK application written in JAVA on below hardware :

one Master node
two Worker Nodes (Each with 502.5 GB available memory and 88 cores(CPUs)).

with following configuration for ./spark-submit command :

--executor-memory=30GB --driver-memory=20G --executor-cores=5 --driver-cores=5

We are using SPARK cluster manager.

It takes 13 minutes to process 10 Million data.

We don't have liberty to share application code.

Can someone suggest configuration for tuning our application for better performance?

Let me know if you need any other detail.

We are using SPARK 2.3.0

EDIT

our data contains 127 columns and 10 million rows. spark started 32 executors with above configuration. we are making an external application call inside flatmap function.

do you think if hardware resources are not enough?

How long would you _expect_ the data processing to take? Without any information about the data and the kind of processing, it's impossible to tell whether 13 minutes indicates a good or bad performance. — Roland Weber, Nov 29 '18 at 10:08
@RolandWeber our data contains 127 columns and 10 million rows. spark started 32 executors with above configuration. we are making an external application call inside flatmap function. — A Learner, Nov 29 '18 at 10:18
Thanks. Could you please add this information to the question itself? Others shouldn't have to read through comments to understand what you're asking about. — Roland Weber, Nov 29 '18 at 10:20
I suppose the external application call is made through some kind of network call ? If so, you'd probably be better of using a flatMapPartitions instean of a simple flat map. The reason is that if you share a HTTP/whatever connexion between calls, you'll be faster opening one per partition, than one per item. — GPI, Nov 29 '18 at 10:48
@GPI I am calling flatMap on dataframe, i couldnt find flatMapPartition method for same. — A Learner, Nov 29 '18 at 11:11
Sorry, my bad, you should use a `mapPartition` that will, inside of it, perform a `flatMap`. Kind of what is done here : https://stackoverflow.com/a/41695377/2131074 but with your external call instead of a DB Connection. If it is a HTTP call, make sure you init the HTTP client once and only once, that you close it properly in the end of the partition mapping, and that you effectively reuse connections (as is the case with the DB connection in the link above). — GPI, Nov 29 '18 at 11:28

score 0 · Answer 1 · answered Nov 29 '18 at 10:16

If you're on Spark standalone cluster, you can try setting the --executor-cores=5 setting to a lower one and have more executors if your operation is not CPU intensive. Also try setting the --total-executor-cores to 88 (or the max no of cores you want to give, this param controls the number of executors you will have) so that you have better control.

Spark performance tuning configuration for 10 Million data

1 Answers1