I run Spark in standalone mode in our cluster (100 machines, 16 CPU cores per machine, 32 GB RAM per machine). I specify SPARK_WORKER_MEMORY and SPARK_WORKER_CORES when running any application.
In Spark programming, I program as if it's a serial program then Spark framework parallelizes the tasks automatically, right?
I encountered OOM crash when I ran the program with SPARK_WORKER_CORES = 16. I tried again with SPARK_WORKER_CORES = 4, the program completed successfully.
Surely, exploiting multiple threads by data parallelism would require larger memory, but I don't know which function in my Spark program is parallelized by the multiple threads. So I don't know which function is in charge of OOM.
I control the number of RDD partitions (degree of parallelism) by taking into account the total number of machines and the amount of memory per worker (per machine), so that each RDD partition of the data can fit in memory.
After partitioning RDDs, a worker in a machine invoke the user-defined functions on each RDD partition to process it.
Here I have the question, How does Spark exploits the multi-core parallelism in each machine?
Which function is parallelized by the multiple threads? Which function should I put special care not to use too much memory within?
Thanks