1

I have a DataFrame df with the following schema:

root
 |-- car: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

Then I do: new_df = df.drop("person.name"). I also tried df.drop(col("person.name")) The schema of new_df:

root
 |-- car: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

The schema of new_df has not changed. Any idea why? Assuming I want a final result with (person.age, car), how to do it?

zero323
  • 322,348
  • 103
  • 959
  • 935
belka
  • 1,480
  • 1
  • 18
  • 31

1 Answers1

4

You will have to separate the person struct column into separate columns and then use drop

new_df.select("car", "person.*").drop("name")

If you want person.age back then you can construct it back as struct

import org.apache.spark.sql.functions._
new_df
  .select("car", "person.*")
  .drop("name")
  .withColumn("person", struct("age"))
  .drop("age")

root
 |-- car: string (nullable = true)
 |-- person: struct (nullable = false)
 |    |-- age: long (nullable = true)

As @RaphaelRoth has pointed out in the comments below that you can just use

new_df.select($"car",struct($"person.age").as("person"))

Or even shorter as

new_df.withColumn("person", struct("person.age"))

udf way

You can even do it in udf way (is not recommended though) (just for your information)

import org.apache.spark.sql.functions._
def removeStruct = udf((p: personOld)=> person(p.age))

new_df.withColumn("person", removeStruct(col("person")))

for that you would need two case classes though

case class personOld(age: Long, name: String)
case class person(age: Long)
belka
  • 1,480
  • 1
  • 18
  • 31
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97