Should slave nodes be launched/started separately on Amazon EMR server?

Question

I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.

I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show() operation or when I try to perform operations on the spark dataframe I am getting errors like -

org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
Task 0 in stage 0.0 failed 1 times; aborting job
java.lang.OutOfMemoryError: GC overhead limit exceeded

My questions are -

When I SSH into the Amazon EMR server and do htop, I see that 5GB out of 8GB is already in use. Why is this?
On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?

share more details on your code, also how you are submitting your code spark-submit or else? Also you can get more details in YARN UI — Rahul, Jun 16 '20 at 07:59
I'm not using spark-submit. Manually running each line of my code. Does using spark-submit make a difference? — ouila, Jun 16 '20 at 08:06
it will allow you to submit the code to cluster, you need to share you code here, i think context is not getting initialized — Rahul, Jun 16 '20 at 08:08
Will make an edit. ~added an edit in the description~ @Rahul — ouila, Jun 16 '20 at 08:09
please use this documentation , you are not configuring any parameters to run spark in cluster mode https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql — Rahul, Jun 16 '20 at 08:21
@Rahul I don’t think you can run spark-shell or pyspark-shell in cluster mode. Using Zeppelin may resolve this issue. — Snigdhajyoti, Jun 16 '20 at 08:24
It is not about spark-shell ,more about the code it has to be run using Spark-Submit — Rahul, Jun 16 '20 at 08:46
@Rahul, are you saying that manually running each line of code in the terminal is not utilising the cluster and is instead running on just the master? — ouila, Jun 16 '20 at 10:13
@ouila make a python file and run that using Spark submit command, here in this scenario the master is used as local also you need to use SparkSession — Rahul, Jun 16 '20 at 10:23

score 0 · Accepted Answer · answered Jun 16 '20 at 07:52

If you are running spark as standalone mode (local[*]) from master then it will only use master node.
How are you submitting spark job?
Use yarn cluster or client mode while submitting spark job to use resources efficiently.
Read more on YARN cluster vs client

Master node runs all the other services like hive, mysql, etc. Those services may taking 5GB of ram if aren’t using standalone mode.

In yarn UI (http://<master-public-dns>:8088) you can check what other containers are running in more detail.

You can check where your spark driver and executer are spinning,
in spark UI http://<master-public-dns>:18080.
Select your job and go to the Executor section, there you would find machine ip of each executor.

Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.

Spark doesn’t start or terminates nodes.
If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
But at-least you need 1 CORE node always running.

How are you submitting spark job? - I'm currently running each line in my code manually. — ouila, Jun 16 '20 at 08:00
Manually via pyspark shell right? If you using spark-shell or pyspark shell your spark driver will get created on Master node. But all other executors may run on CORE node also. In this case your driver process will use some of the memory. Please check on SparkUI where the executor are running. — Snigdhajyoti, Jun 16 '20 at 08:17

Should slave nodes be launched/started separately on Amazon EMR server?

1 Answers1