I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.
I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show()
operation or when I try to perform operations on the spark dataframe I am getting errors like -
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
Task 0 in stage 0.0 failed 1 times; aborting job
java.lang.OutOfMemoryError: GC overhead limit exceeded
My questions are -
- When I SSH into the Amazon EMR server and do
htop
, I see that 5GB out of 8GB is already in use. Why is this? - On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?