5

I have 55 GB data that needs to be processed. I'm running Spark-shell on a single machine with 32 cores and 180GB RAM (No cluster). Since it's a single node both- Driver and Workers reside in the same JVM process and by default use 514 MB.

I set spark.driver.memory to 170G

spark-shell  --driver-memory 170g

I'm doing a map operation followed by group_by and then agg and write to a parquet file. And it's still stuck at enter image description here

Is there anyway to optimize the performance by changing the spark.executor.memory or changing the number of cores used instead of using Master[*]? How can one determine the optimal setting for a given task and data size? what values in --conf files I should be exactly tweaking?

In short, how to force spark to use all the resources available in the best possible way?

Neo
  • 676
  • 1
  • 4
  • 12
  • You can check the CPU and memory usage in `top` or something like that. I think spark by default uses all the cores available. The process might be slow because of both slow I/O while reading that large chunk of data, and/or slow computation due to the sheer amount of computation involved. – mck Nov 07 '20 at 10:14

1 Answers1

-2

Changing spark.executor.memory doesn't take effect if you're running on a single computer. You need to have an actual cluster. You can add more nodes to the cluster to lower the number of partitions and speed up processing.

Buggy
  • 109
  • 1
  • 5