What is XGBoost-Spark doing during training and how to optimise its usage of resources?

Question

I am running XGBoost on Spark 2.2/Scala in cluster mode using spark-submit on Mesos.

I want XGBoost to run in parallel, so I have set the tree_method=approx since it is said in XGBoost Parameters for R that:

Distributed and external memory version only support approximate algorithm

It is not clear to me this is also true for Spark version of XGBoost.

I have set XGBoost to use 100 workers (nWorkers=100), total-executor-cores 100 in my spark-submit setup and .config("spark.executor.cores", 5) in my SparkSession setup. In total I have then 20 executors + 1 driver. Each executor with 5 cpus.

In my Stage3 XGBoost starts to run. I see this in the StdOut:

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=192.168.50.105, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=100}

In StdErr I see:

[Stage 3:>                                                        (0 + 100) / 100]
[Stage 3:>                                                        (0 + 100) / 100]
[Stage 3:>                                                        (0 + 100) / 100]
(repeated many times)

In SparkUI-executors tab I see that all 20 executors run 5 tasks each.

In SparkUI-Stage tab, I see 100 tasks running at once.

In Mesos UI - Agents tab I see job running on each node and executor, but never all cpu's running. For each executor, cup utilisation reaches maybe 1.5 at most instead of 5 as requested. It looks like my resources are heavily under-utilised.

XGBoost stays in this state for about 30 minutes producing many [Stage 3:> ... 0 + 100) / 100] messages and then at once all tasks are over and the model is ready.

I saw this post Spark ML gradient boosted trees not using all nodes but it does not help much

I also saw these three articles about parallelisation of XGBoost training phase

Questions:

What is XGBoost doing in the Stage3?
Is it really parallelised?
Why isn't using all 5 cpus at each executor?
How to optimize usage of resources during training?
From xgboost documentation it is not clear if XGBoost parallelising on columns as in article 3 or on rows with histograms as in 2?
Is there a shuffle involved in XGBoost training?

What is XGBoost-Spark doing during training and how to optimise its usage of resources?

0 Answers0