I need to replace some value in a data-frame (with nested schema) with null
, I have seen this solution but it seems it only works for one
level nested schema.
My schema is something like this
root
......
......
......
|-- user: struct (nullable = true)
| |-- country: string (nullable = true)
| |-- id: string (nullable = true)
| |-- ip_address: string (nullable = true)
| |-- state: struct (nullable = true)
| | |-- level: long (nullable = true)
| | |-- session_id: string (nullable = true)
| | |-- xp: long (nullable = true)
What I wanted to do is replace user.state.level
and user.state.xp
with null
and keep the rest of my data frame untouched.
Is there any way that I can achieve this?
If I follow this solution
val myUDF = udf((s:String) => {
null
})
val structCols: Array[org.apache.spark.sql.Column] = badVersion.select($"user.*")
.columns
.map(name => col("user."+name))
val newDF = badVersion.withColumn(
"user",
struct((structCols:+myUDF($"user.country").as("country")):_*)
)
It works for the country and replaces the value but if I do this for
val newDF = badVersion.withColumn(
"user",
struct((structCols:+myUDF($"user.country").as("country"):+myUDF($"user.state.level").as("state.level")):_*)
)
It's just gonna add state.level
as a new field