PySpark: extract values from from struct type

Question

I have a Spark dataframe where one of the columns (called features) is a struct type, specifically:

struct<type:tinyint,size:int,indices:array<int>,values:array<double>>

When I do df.printSchema(), this is what I get:

root
 |-- features: vector (nullable = true)

What I would like to do, is to have the values of the above struct in a separate column.

I have tried:

df.select("features.values").show()

But then I get the error:

AnalysisException: Can't extract value from features#125369: need struct type but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;

Which I don't understand, especially the part where it says need struct type but got struct (??). Can someone help me with this?

score 1 · Answer 1 · answered Dec 07 '20 at 16:59

1

you may need to convert the vector to array first:

from pyspark.ml.functions import vector_to_array

df2 = df.select(vector_to_array("features").alias("features"))

and then select the appropriate columns.

answered Dec 07 '20 at 16:59

mck

40,932
13
35
50

Hi @mck, thanks for your answer. I tried this function, but I seem to be getting a different array than the one in the `values` array of the `features` column. Here's an example of 1 row: the `values` array of `features` contains this: `[1, 0.5, 1, 1, 1, 1, 1, 1, 0.5, 1]` but the "new" features (after applying `vector_to_array`) contains: `[1, 0, 0.5, 0, 0, 0, 0, 0, 0, 0]`. Do you know why this is not the same array? I can't seem to find any good documentation on this function. – vdvaxel Dec 08 '20 at 09:10
@vdvaxel that’s strange, but I can’t say anything without seeing some code – mck Dec 08 '20 at 10:23

PySpark: extract values from from struct type

1 Answers1

Linked