I have 55 GB data that needs to be processed. I'm running Spark-shell on a single machine with 32 cores and 180GB RAM (No cluster). Since it's a single node both- Driver and Workers reside in the same JVM process and by default use 514 MB.
I set spark.driver.memory to 170G
spark-shell --driver-memory 170g
I'm doing a map operation followed by group_by and then agg and write to a parquet file. And it's still stuck at
Is there anyway to optimize the performance by changing the spark.executor.memory or changing the number of cores used instead of using Master[*]? How can one determine the optimal setting for a given task and data size? what values in --conf files I should be exactly tweaking?
In short, how to force spark to use all the resources available in the best possible way?