I'm writing a research thesis and I want to understand better how Apache Spark works and how driver and executor instances work as well.
Actually I'm extracting the cluster resource consumption metrics using Graphite and Grafana, I'm analyzing 3 python program that execute slight different relational analysis over 3 input files whose file sizes are 400MB,800MB and 2GB.
I have a virtualized (through Docker) local cluster made of 3 nodes (1 driver and 2 workers) (default 1GB RAM EACH) in Standalone mode.
First of all I would like to understand if the heap allocated for each JVM (executor instance I think) is dynamical, hence what if the executor is running out of memory? Will the JVM automatically allocate more RAM memory if available or it will simply launch the JavaHeapSpace message error?
Secondly, I've noticed that in the third program which analyzes the 2GB csv input file, the driver shuts down an executor and starts a new one after a while. What could be the reason?
Thirdly, Is it up to the driver or to the cluster manager to shut down executor instances?
Fourthly, is there some sort of caching in Apache Spark? I've noticed that the first time I submit a Spark App it lasts longer than the following executions.
I would like to understand better this aspect but I didn't find anything on the internet