What about the files with smaller size than the hadoop block size: spark + machine learning

Question

My hadoop block size if 128 MB and my file is 30 MB. And my cluster on which spark is running is a 4 node cluster with total of 64 cores.

And now my task is to run a random forest or gradient boosting algorithm with paramater grid and 3-fold cross validation on top of this.

few lines of the code:

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, CrossValidator}
import org.apache.spark.ml.regression.GBTRegressor

val gbt_model = new GBTRegressor().setLabelCol(target_col_name).setFeaturesCol("features").setMaxIter(2).setMaxDepth(2).setMaxBins(1700)
var stages: Array[org.apache.spark.ml.PipelineStage] = index_transformers :+ assembler :+ gbt_model
val paramGrid = new ParamGridBuilder().addGrid(gbt_model.maxIter, Array(100, 200)).addGrid(gbt_model.maxDepth, Array(2, 5, 10)).build()

val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new RegressionEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5)
val cvModel = cv.fit(df_train)

My file has around

Input: 10 discrete/ string/ char features + 2 integer feature

Output: A Integer response/ output variable

and this takes more than 4 hours to run on my cluster. What I observed is that my code runs on just 1 node with just 3 containers.

Questions:

What I can do here to make sure my code runs on all the four nodes or uses the maximum possible cores for fast computation.
What can I do in terms of partitioning my data (DataFrame in scala, and csv file on Hadoop cluster) to get speed and computation improvements

Regards,

score 1 · Answer 1 · answered Jun 02 '16 at 12:57

1

When you submit your job, you can pass the number of executors you want via the parameter --num-executors. You can also specify the number of cores and the amount of memory each executor will use, via --executor-cores and --executor-memory.

answered Jun 02 '16 at 12:57

mgaido

2,987
3
17
39

why not this: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop ? – Abhishek Jun 02 '16 at 14:10
for two reasons: 1. the split size doesn't affect the number of executors, but the partition number; 2. the number of partitions is set in Spark, so you can choose the number of partitions despite the number of splits in files. I think you have to understand better which is the difference between partitions (the logical parallelism) and executors (the physical parallelism): more executors means that more tasks (operations on chunks of data) are executed in parallel; more partitions means smaller (and more) chunks of data. – mgaido Jun 02 '16 at 14:48
Hey @mark91 thank you so much. So How should I choose the number of partition in spark ? And if I got you right you mean to say that all I need is more executors to fix my problem? – Abhishek Jun 02 '16 at 19:23
Yes!You can surf a bot over the internet, there are many best practices, but a good starting point is to use all your resources (cores and memory) and then tune with a few experiments. – mgaido Jun 03 '16 at 07:31

What about the files with smaller size than the hadoop block size: spark + machine learning

1 Answers1