pyspark make new column with single Row element of other column?

Question

I trained a xgb classifier model in pyspark and transformed some data via

outp = model.transform(inp)

now outp contains a column 'probability' with row entries such as

Row(probability=DenseVector([0.99,0.01]))

I'd like to add a new column to outp, that contains rows of floats from the second probability component of the Row elements mentioned above (so e.g. just 0.01 instead of Row(...) ). What is the correct syntax to do that?

I tried

outp = outp.select("*",(col('probability')[:,1]).alias('prob'))

expecting that the first element of each row in the column will be selected. But that syntax produces an error.

convert `DenseVector` to `array` using [this solution](https://stackoverflow.com/a/58495099/8279585), and then use `getItem()` column method to extract second element from the resulting array — samkart, Apr 24 '23 at 15:31

score 0 · Accepted Answer · answered Apr 24 '23 at 15:56

0

Using the suggestion from the comment by samkart, I changed the syntax to:

outp = outp.select("*",(vector_to_array('probability').getItem(1)).alias('prob'))

and now it does what I wanted.

answered Apr 24 '23 at 15:56

user007

1
2

pyspark make new column with single Row element of other column?

1 Answers1