My hadoop block size if 128 MB and my file is 30 MB. And my cluster on which spark is running is a 4 node cluster with total of 64 cores.
And now my task is to run a random forest or gradient boosting algorithm with paramater grid and 3-fold cross validation on top of this.
few lines of the code:
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, CrossValidator}
import org.apache.spark.ml.regression.GBTRegressor
val gbt_model = new GBTRegressor().setLabelCol(target_col_name).setFeaturesCol("features").setMaxIter(2).setMaxDepth(2).setMaxBins(1700)
var stages: Array[org.apache.spark.ml.PipelineStage] = index_transformers :+ assembler :+ gbt_model
val paramGrid = new ParamGridBuilder().addGrid(gbt_model.maxIter, Array(100, 200)).addGrid(gbt_model.maxDepth, Array(2, 5, 10)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new RegressionEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5)
val cvModel = cv.fit(df_train)
My file has around
Input: 10 discrete/ string/ char features + 2 integer feature
Output: A Integer response/ output variable
and this takes more than 4 hours to run on my cluster. What I observed is that my code runs on just 1 node with just 3 containers.
Questions:
- What I can do here to make sure my code runs on all the four nodes or uses the maximum possible cores for fast computation.
- What can I do in terms of partitioning my data (DataFrame in scala, and csv file on Hadoop cluster) to get speed and computation improvements
Regards,