0

I have a emr cluster with one master node and four worker nodes, each with 4 cores, 16gb RAM and four 80gb Hard Disks. My problem is when I tried to train a tree-based model such as Decision Tree, Random Forest or Gradient Boosting, the training process cannot be done and Yarn/Spark complained out of disk space error.

Here is the code that I tried to run:

from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

rfr = DecisionTreeRegressor(featuresCol="features", labelCol="age")

cvModel = rfr.fit(age_training_data) 

The process always stuck at MapPartitionsRDD map at BaggedPoint.scala step. I tried to run df -h command to examine the disk space. Before the process started, only 1 or 2% or space was used. But when I started the training process, 100% was used for at two hard disks. This problem occurs in not just Decision Tree, but also Random Forest and Gradient Boosting too. I was able to run Logistic Regression on the same dataset without any problem.

Highly appreciated if anyone knows the possible cause of this. Please take a look at the image of the stage which the error happens. Thank you!![Stage Failure

UPDATE:Here is the setting of the spark-defaults.conf:

spark.history.fs.cleaner.enabled true
spark.eventLog.enabled           true
spark.driver.extraLibraryPath    /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath  /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions    -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions  -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled  false
spark.driver.memory                 8g
spark.yarn.driver.memoryOverhead    7g
spark.driver.cores                  3
spark.executor.memory               10g
spark.yarn.executor.memoryOverhead  2048m
spark.executor.instances             4
spark.executor.cores                 2
spark.default.parallelism           48
spark.yarn.max.executor.failures    32
spark.network.timeout               100000000s
spark.rpc.askTimeout                10000000s
spark.executor.heartbeatInterval    100000000s
spark.ui.view.acls *
#spark.serializer                    org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions     -XX:+UseG1GC

As for the data, it has around 120000000 records and the estimated size of the rdd is 1925055624 bytes (which I do not know if it is accurate as I follow here to get the estimated size).

I also found out the space is mostly used up by the temporary data blocks that are written to the machine during the training process, which seems to happen only when I used the tree-based model.

jscu
  • 61
  • 6
  • Please provide details about spark configuration details while submitting the job and also, the data size. – Suresh Mar 28 '18 at 08:34
  • Have you checked the max size of /tmp? – jho Mar 28 '18 at 10:23
  • @jho I am sorry do you mind telling me how to check the max size? – jscu Mar 28 '18 at 10:39
  • Just use df and look for /tmp or /tmpfs. You find the max size in column 'available'. For further explanations, have a look at https://superuser.com/questions/619324/my-tmp-folder-isnt-a-partition-but-has-fixed-size-why – jho Mar 28 '18 at 11:09
  • @jho The available column is showing 77421680. – jscu Mar 28 '18 at 12:16
  • The "ML" tag concerns the programming language so I removed it. – molbdnilo Mar 28 '18 at 14:14
  • @jscu I'm sorry, I totally read over this part "Before the process started, only 1 or 2% or space was used. But when I started the training process, 100% was used for at two hard disks. " So, yes, your disk is full. What about the usage on the other two disks? If it is very low, perhaps perhaps partitioning of your data may help?! But I'm only guessing right here. – jho Mar 28 '18 at 15:29
  • @jho Yes, usually two disks are full but the other two disks might not be used at all. I will try with increasing the number of partitions to see if it helps. – jscu Mar 28 '18 at 16:10
  • @jho It turns out repartitioning the data with a larger number of partitions does not help either. Two disks are used 100% and the ML process cannot start. Anyone can help? – jscu Apr 03 '18 at 05:00

0 Answers0