1

Training xgboost like so then getting back a set of response and probabilities. The probabilities come back as a vector:

%scala 
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}

val dataset = sqlContext.table("train_set")

val paramMap = List(
      "eta" -> 0.023f,
      "max_depth" -> 10,
      "base_score" -> 0.005,
      "eval_metric" -> "auc",
      "seed" -> 49,
      "objective" -> "binary:logistic").toMap

val xgboostModel = XGBoost.trainWithDataFrame(dataset, paramMap, 30, 10, useExternalMemory=true) 

val test_dataset = sqlContext.table("test_set")
val predictions = xgboostModel.setExternalMemory(true).transform(test_dataset).select("some_key", "probabilities")

org.apache.spark.sql.DataFrame = [some_key: int, probabilities: vector]

/*
+--------+-------------+
|some_key|probabilities|
+--------+----+--------+
|       0| [0.98,0.02] |
|       1| [0.95,0.05] |
|       2| [0.99,0.01] |
|       3| [0.82,0.18] |
+--------+-------------+
*/

I just want the second probability not the whole vector. How would I create a new dataframe with just that and the some_key?

/*
+--------+-----------+
|some_key|probability|
+--------+-----------+
|       0|      0.02 |
|       1|      0.05 |
|       2|      0.01 |
|       3|      0.18 |
+--------+-----------+
*/

0 Answers0