0

I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.

I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show() operation or when I try to perform operations on the spark dataframe I am getting errors like -

  1. org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
  2. Task 0 in stage 0.0 failed 1 times; aborting job
  3. java.lang.OutOfMemoryError: GC overhead limit exceeded

My questions are -

  1. When I SSH into the Amazon EMR server and do htop, I see that 5GB out of 8GB is already in use. Why is this?
  2. On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?
ouila
  • 45
  • 1
  • 9
  • share more details on your code, also how you are submitting your code spark-submit or else? Also you can get more details in YARN UI – Rahul Jun 16 '20 at 07:59
  • I'm not using spark-submit. Manually running each line of my code. Does using spark-submit make a difference? – ouila Jun 16 '20 at 08:06
  • it will allow you to submit the code to cluster, you need to share you code here, i think context is not getting initialized – Rahul Jun 16 '20 at 08:08
  • 1
    there is Zeppelin service in EMR, try that – Rahul Jun 16 '20 at 08:08
  • Will make an edit. ~added an edit in the description~ @Rahul – ouila Jun 16 '20 at 08:09
  • please use this documentation , you are not configuring any parameters to run spark in cluster mode https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql – Rahul Jun 16 '20 at 08:21
  • @Rahul I don’t think you can run spark-shell or pyspark-shell in cluster mode. Using Zeppelin may resolve this issue. – Snigdhajyoti Jun 16 '20 at 08:24
  • It is not about spark-shell ,more about the code it has to be run using Spark-Submit – Rahul Jun 16 '20 at 08:46
  • Best to use Zeppelin then CLI – Rahul Jun 16 '20 at 08:47
  • @Rahul, are you saying that manually running each line of code in the terminal is not utilising the cluster and is instead running on just the master? – ouila Jun 16 '20 at 10:13
  • @ouila make a python file and run that using Spark submit command, here in this scenario the master is used as local also you need to use SparkSession – Rahul Jun 16 '20 at 10:23

1 Answers1

0

If you are running spark as standalone mode (local[*]) from master then it will only use master node.
How are you submitting spark job?
Use yarn cluster or client mode while submitting spark job to use resources efficiently.
Read more on YARN cluster vs client

Master node runs all the other services like hive, mysql, etc. Those services may taking 5GB of ram if aren’t using standalone mode.

In yarn UI (http://<master-public-dns>:8088) you can check what other containers are running in more detail.

You can check where your spark driver and executer are spinning,
in spark UI http://<master-public-dns>:18080.
Select your job and go to the Executor section, there you would find machine ip of each executor.

Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.

Spark doesn’t start or terminates nodes.
If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
But at-least you need 1 CORE node always running.

Snigdhajyoti
  • 1,327
  • 10
  • 26
  • How are you submitting spark job? - I'm currently running each line in my code manually. – ouila Jun 16 '20 at 08:00
  • Manually via pyspark shell right? If you using spark-shell or pyspark shell your spark driver will get created on Master node. But all other executors may run on CORE node also. In this case your driver process will use some of the memory. Please check on SparkUI where the executor are running. – Snigdhajyoti Jun 16 '20 at 08:17