How to extract a value from a Vector in a column of a Spark Dataframe

Question

When using SparkML to predict labels the result Dataframe is:

scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.6,0.4]|           1.0|
|  [0.6,0.4]|           1.0|
|  [1.0,0.0]|           1.0|
|  [0.9,0.1]|           1.0|
|  [0.9,0.1]|           1.0|
|  [1.0,0.0]|           1.0|
|  [1.0,0.0]|           1.0|
+-----------+--------------+
only showing top 20 rows

I want to create a new Dataframe with a new column named prob which is the first value from the Vector in probability column of original Dataframe e.g.:

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

How can extract this value into a new column?

import org.apache.spark.ml.functions.vector_to_array and then model.withColumn("prob", vector_to_array(col("probability")).getItem(1)).show(false)) — Dariusz Krynicki, Aug 20 '21 at 12:31

score 7 · Answer 1 · answered May 02 '17 at 12:43

7

You can use the capabilities of Dataset and the wonderful functions library to accomplish what you need:

result.withColumn("prob", $"probability".getItem(0))

This adds a new Column called prob whose value is derived from the probability Column by taking the first item (at index 0--we are computer scientists after all) in the array.

I would mention also that UDFs should be your last resort because the Catalyst optimizer cannot currently optimize UDFs, so you should always prefer the built-in functions to get the most out of Catalyst.

answered May 02 '17 at 12:43

Vidya

29,932
7
42
70

10

This would work with `ArrayType` not with `org.apache.spark.ml.linalg.VectorUDT`. – zero323 Nov 07 '18 at 22:01
2

Name: org.apache.spark.sql.AnalysisException Message: Can't extract value from probability#16416; StackTrace: at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) – rishiehari Aug 28 '19 at 02:50

score 1 · Answer 2 · answered May 02 '17 at 06:27

It is fairly simple if you use Spark UDF(s). Like this:

val headValue = udf((arr: Seq[Double]) => arr.head)

result.withColumn("prob", headValue(result("probability"))).show

It will give you desired output:

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

`probability` output is `org.apache.spark.ml.linalg.VectorUDT` not `ArrayType(DoubleType)`. — zero323, Nov 07 '18 at 22:02

How to extract a value from a Vector in a column of a Spark Dataframe

2 Answers2