Spark - Update a nested column to string

Question

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: struct (nullable = true)
 |    |    |-- z: struct (nullable = true)
 |    |    |    |-- aa: string (nullable = true)

I have the above nested schema where I want to change column z from struct to string.

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: struct (nullable = true)
 |    |    |-- z: string (nullable = true)

I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.

Is there a way to do this with some udf or any other way?

I know it's easy to do this via to_json but the nested array of struct is causing issues.

score 2 · Accepted Answer · edited Apr 29 '22 at 12:04

2

cast as in higher order function

df3=df.withColumn('x', expr('transform(x, s-> struct(s.y as y,cast(to_json(s.z) as string) as z))')).printSchema()

root
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- y: struct (nullable = true)
 |    |    |-- z: string (nullable = true)

edited Apr 29 '22 at 12:04

Chirag Sejpal

877
2
9
17

answered Apr 28 '22 at 10:15

wwnde

26,119
6
18
32

Doesn't work - I get `org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct(NamePlaceholder(), namedlambdavariable() due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 25` – Chirag Sejpal Apr 28 '22 at 16:53
1

Instead of `s.y` you should use `s.y as y` – ZygD Apr 28 '22 at 18:10
@ZygD That seems to have worked but it removes the key aa from the stringified version of the struct. Any way to have the key in the string as well? – Chirag Sejpal Apr 28 '22 at 18:49
Hah doing cast(to_json(s.z) as string) seems to keep the key as well. – Chirag Sejpal Apr 28 '22 at 18:51
wwnde thanks for the great solution - I edited it a bit. Also, similar question here https://stackoverflow.com/questions/72037029/update-a-highly-nested-column-from-string-to-struct - I'm trying to use your logic for that as well but I'm not sure how to pass in schema for for_json to work inside expr/transform. Any suggestions? Can you take a look at that question too? Also thanks @ZygD – Chirag Sejpal Apr 28 '22 at 20:05
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/from_json - used this as a reference it works for the other question too! – Chirag Sejpal Apr 29 '22 at 00:40
@Chirag Sejpal, sorry was at work. answered, could you please let me know if that helps. Happy to assist further – wwnde Apr 29 '22 at 12:05

score 1 · Answer 2 · answered Apr 28 '22 at 09:56

For your specific case, you can do it with built-in functions on Spark 2.4 or Spark 3.0

Spark 2.4

You can use arrays_zip as follows:

first, you create arrays for each field you want to have as struct element of your array
second, you use arrays_zip to zip those fields

Here is the complete code, with df your input dataframe:

import org.apache.spark.functions.{arrays_zip, col}

df.withColumn("x",
      arrays_zip(
        col("x").getField("y").alias("y"),
        col("x").getField("z").getField("aa").alias("z")
      ))

Spark 3.0

You can use transform to rebuild element struct of your array, as follows:

df.withColumn("x", transform(
      col("x"),
      element => struct(
        element.getField("y").alias("y"),
        element.getField("z").getField("aa").alias("z")
      )
    ))

I'm looking to stringify the struct as opposed to getting aa out of z. — Chirag Sejpal, Apr 28 '22 at 16:09

Spark - Update a nested column to string

2 Answers2

Spark 2.4

Spark 3.0