I am running XGBoost on Spark 2.2/Scala in cluster mode using spark-submit on Mesos.
I want XGBoost to run in parallel, so I have set the tree_method=approx
since it is said in XGBoost Parameters for R that:
Distributed and external memory version only support approximate algorithm
It is not clear to me this is also true for Spark version of XGBoost.
I have set XGBoost to use 100 workers (nWorkers=100
), total-executor-cores 100
in my spark-submit
setup and .config("spark.executor.cores", 5)
in my SparkSession
setup. In total I have then 20 executors + 1 driver. Each executor with 5 cpus.
In my Stage3 XGBoost starts to run. I see this in the StdOut:
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.50.105, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=100}
In StdErr I see:
[Stage 3:> (0 + 100) / 100]
[Stage 3:> (0 + 100) / 100]
[Stage 3:> (0 + 100) / 100]
(repeated many times)
In SparkUI-executors tab I see that all 20 executors run 5 tasks each.
In SparkUI-Stage tab, I see 100 tasks running at once.
In Mesos UI - Agents tab I see job running on each node and executor, but never all cpu's running. For each executor, cup utilisation reaches maybe 1.5 at most instead of 5 as requested. It looks like my resources are heavily under-utilised.
XGBoost stays in this state for about 30 minutes producing many [Stage 3:> ... 0 + 100) / 100] messages
and then at once all tasks are over and the model is ready.
I saw this post Spark ML gradient boosted trees not using all nodes but it does not help much
I also saw these three articles about parallelisation of XGBoost training phase
- How does XGBoost do parallel computation??
- Parallel Computation with R and XGBoost
- Parallel Gradient Boosting Decision Trees
- Parallelism in XGBoost machine learning technique
Questions:
- What is XGBoost doing in the Stage3?
- Is it really parallelised?
- Why isn't using all 5 cpus at each executor?
- How to optimize usage of resources during training?
- From xgboost documentation it is not clear if XGBoost parallelising on columns as in article 3 or on rows with histograms as in 2?
- Is there a shuffle involved in XGBoost training?