DataFrame drop column not working

Question

I have a DataFrame df with the following schema:

root
 |-- car: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

Then I do: new_df = df.drop("person.name"). I also tried df.drop(col("person.name")) The schema of new_df:

root
 |-- car: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

The schema of new_df has not changed. Any idea why? Assuming I want a final result with (person.age, car), how to do it?

score 4 · Accepted Answer · edited Feb 28 '18 at 17:35

You will have to separate the person struct column into separate columns and then use drop

new_df.select("car", "person.*").drop("name")

If you want person.age back then you can construct it back as struct

import org.apache.spark.sql.functions._
new_df
  .select("car", "person.*")
  .drop("name")
  .withColumn("person", struct("age"))
  .drop("age")

root
 |-- car: string (nullable = true)
 |-- person: struct (nullable = false)
 |    |-- age: long (nullable = true)

As @RaphaelRoth has pointed out in the comments below that you can just use

new_df.select($"car",struct($"person.age").as("person"))

Or even shorter as

new_df.withColumn("person", struct("person.age"))

udf way

You can even do it in udf way (is not recommended though) (just for your information)

import org.apache.spark.sql.functions._
def removeStruct = udf((p: personOld)=> person(p.age))

new_df.withColumn("person", removeStruct(col("person")))

for that you would need two case classes though

case class personOld(age: Long, name: String)
case class person(age: Long)

DataFrame drop column not working

1 Answers1