0

I'm writing a research thesis and I want to understand better how Apache Spark works and how driver and executor instances work as well.

Actually I'm extracting the cluster resource consumption metrics using Graphite and Grafana, I'm analyzing 3 python program that execute slight different relational analysis over 3 input files whose file sizes are 400MB,800MB and 2GB.

I have a virtualized (through Docker) local cluster made of 3 nodes (1 driver and 2 workers) (default 1GB RAM EACH) in Standalone mode.

First of all I would like to understand if the heap allocated for each JVM (executor instance I think) is dynamical, hence what if the executor is running out of memory? Will the JVM automatically allocate more RAM memory if available or it will simply launch the JavaHeapSpace message error?

Secondly, I've noticed that in the third program which analyzes the 2GB csv input file, the driver shuts down an executor and starts a new one after a while. What could be the reason? RAM USED GRAPH

Thirdly, Is it up to the driver or to the cluster manager to shut down executor instances?

Fourthly, is there some sort of caching in Apache Spark? I've noticed that the first time I submit a Spark App it lasts longer than the following executions.

I would like to understand better this aspect but I didn't find anything on the internet

1 Answers1

0

First of all I would like to understand if the heap allocated for each JVM (executor instance I think) is dynamical, hence what if the executor is running out of memory? Will the JVM automatically allocate more RAM memory if available or it will simply launch the JavaHeapSpace message error?

It depends on what type of memory in the executor. pls refer to this blog: https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d

Execution Memory — Spark Processing or generated data like RDD transformation. Used for shuffle, join, sort. Will spill to disk in case of allocated memory limit is breached. This is short-lived.

remember that spark take the advantage of using memory to compute but in case of large data, it can also spill the data to the disk. However for the User Memory type, it does raise the OOM error when the limit is reached.

Secondly, I've noticed that in the third program which analyzes the 2GB csv input file, the driver shuts down an executor and starts a new one after a while. What could be the reason?

The reasons why executor is killed can be various. I think you must dig into the log to see why this happen. U can refer to this link about how to find some hint in your log. Spark application kills executor and this: Why are the executors getting killed by the driver?

Thirdly, Is it up to the driver or to the cluster manager to shut down executor instances?

Usually it's the cluster master who take the charge. The driver is simply the main process who run your spark job which doesn't contain any instruction to allocate the executor or killing one right? (in some cases you can do so, but it's normally not used) refer code

Fourthly, is there some sort of caching in Apache Spark? I've noticed that the first time I submit a Spark App it lasts longer than the following executions.

Spark does support caching in its job. however if you are submitting job individually to spark. spark will not maintain a context to cache any data. Does your 3 jobs share one sparkSession? Do you call spark.stop() when you finish your individual job? if it is the case, there won't be any caching. but I guess you are using maybe note book and your jobs are using one same sparksession. A good way to check why ur later jobs are faster is to check your spark web UI. Try to compare the differences jobs and you would find the reason of the differences.

Young
  • 536
  • 2
  • 19