0

I have a Spark dataframe where one of the columns (called features) is a struct type, specifically:

struct<type:tinyint,size:int,indices:array<int>,values:array<double>>

When I do df.printSchema(), this is what I get:

root
 |-- features: vector (nullable = true)

What I would like to do, is to have the values of the above struct in a separate column.

I have tried:

df.select("features.values").show()

But then I get the error:

AnalysisException: Can't extract value from features#125369: need struct type but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;

Which I don't understand, especially the part where it says need struct type but got struct (??). Can someone help me with this?

vdvaxel
  • 667
  • 1
  • 14
  • 41

1 Answers1

1

you may need to convert the vector to array first:

from pyspark.ml.functions import vector_to_array

df2 = df.select(vector_to_array("features").alias("features"))

and then select the appropriate columns.

mck
  • 40,932
  • 13
  • 35
  • 50
  • Hi @mck, thanks for your answer. I tried this function, but I seem to be getting a different array than the one in the `values` array of the `features` column. Here's an example of 1 row: the `values` array of `features` contains this: `[1, 0.5, 1, 1, 1, 1, 1, 1, 0.5, 1]` but the "new" features (after applying `vector_to_array`) contains: `[1, 0, 0.5, 0, 0, 0, 0, 0, 0, 0]`. Do you know why this is not the same array? I can't seem to find any good documentation on this function. – vdvaxel Dec 08 '20 at 09:10
  • @vdvaxel that’s strange, but I can’t say anything without seeing some code – mck Dec 08 '20 at 10:23