3

I need to replace some value in a data-frame (with nested schema) with null, I have seen this solution but it seems it only works for one level nested schema.

My schema is something like this

root 
 ......
 ......
 ......
 |-- user: struct (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- ip_address: string (nullable = true)
 |    |-- state: struct (nullable = true) 
 |    |    |-- level: long (nullable = true)
 |    |    |-- session_id: string (nullable = true) 
 |    |    |-- xp: long (nullable = true)

What I wanted to do is replace user.state.level and user.state.xp with null and keep the rest of my data frame untouched.

Is there any way that I can achieve this?

If I follow this solution

val myUDF = udf((s:String) => {
    null
})

val structCols: Array[org.apache.spark.sql.Column] = badVersion.select($"user.*")
    .columns
    .map(name => col("user."+name))

val newDF = badVersion.withColumn(
    "user",
    struct((structCols:+myUDF($"user.country").as("country")):_*)
)

It works for the country and replaces the value but if I do this for

val newDF = badVersion.withColumn(
    "user",
    struct((structCols:+myUDF($"user.country").as("country"):+myUDF($"user.state.level").as("state.level")):_*)
)

It's just gonna add state.level as a new field

enter image description here

Am1rr3zA
  • 7,115
  • 18
  • 83
  • 125

1 Answers1

0

Based on @Auprba link in the comment I have used this link and came up with this solution.

val replaced = df.selectExpr("""
    named_struct(
         .....................................................
         ....... Other columns ...............................
         ....... In a form of  ...............................
         ....... '{columnname}', {columnname}, ...............
         .....................................................
        'user', named_struct(
          'country', user.country,
          'id', user.id,
          'ip_address', user.ip_address,
          'state', named_struct('hard_currency', null, 'level', null, 'session_id', user.state.session_id, 'soft_currency', null, 'xp', null)
        )
    ) as named_struct
""").select("named_struct.*")
display(replaced)
Am1rr3zA
  • 7,115
  • 18
  • 83
  • 125