Training xgboost like so then getting back a set of response and probabilities. The probabilities come back as a vector:
%scala
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
val dataset = sqlContext.table("train_set")
val paramMap = List(
"eta" -> 0.023f,
"max_depth" -> 10,
"base_score" -> 0.005,
"eval_metric" -> "auc",
"seed" -> 49,
"objective" -> "binary:logistic").toMap
val xgboostModel = XGBoost.trainWithDataFrame(dataset, paramMap, 30, 10, useExternalMemory=true)
val test_dataset = sqlContext.table("test_set")
val predictions = xgboostModel.setExternalMemory(true).transform(test_dataset).select("some_key", "probabilities")
org.apache.spark.sql.DataFrame = [some_key: int, probabilities: vector]
/*
+--------+-------------+
|some_key|probabilities|
+--------+----+--------+
| 0| [0.98,0.02] |
| 1| [0.95,0.05] |
| 2| [0.99,0.01] |
| 3| [0.82,0.18] |
+--------+-------------+
*/
I just want the second probability not the whole vector. How would I create a new dataframe with just that and the some_key?
/*
+--------+-----------+
|some_key|probability|
+--------+-----------+
| 0| 0.02 |
| 1| 0.05 |
| 2| 0.01 |
| 3| 0.18 |
+--------+-----------+
*/