5

I have a dataframe in the following structure:

root
 |-- index: long (nullable = true)
 |-- text: string (nullable = true)
 |-- topicDistribution: struct (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- wiki_index: string (nullable = true)

I need to change it to:

root
 |-- index: long (nullable = true)
 |-- text: string (nullable = true)
 |-- topicDistribution: array (nullable = true)
 |    |--  element: double (containsNull = true)
 |-- wiki_index: string (nullable = true)

May I ask how can I do that?

Thanks a lot.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Ippon
  • 169
  • 1
  • 2
  • 11

1 Answers1

10

I think you're looking for

df.withColumn("topicDistribution", col("topicDistribution").getField("values"))
ayplam
  • 1,943
  • 1
  • 14
  • 20
  • This is an interesting use case and solution. However, the `topicDistribution` column remains of type `struct` and not `array` and I have not yet figured out how to convert between these two types. – Simon Z. Sep 07 '18 at 10:39
  • How can this be done dynamically . my withColumn should dynamically create all the columsn based on column name of keys ? – Naveen Srikanth Nov 21 '18 at 11:55
  • 1
    I don't have code on hand, but you can do something like: 1. `struct_keys = ...## go through schema to figure out the column keys` 2. `new_cols = [col("yourStruct").getField(kk) for kk in struct_keys]` 3. `df.select(*(new_cols + orig_cols))` – ayplam Nov 21 '18 at 18:31