2
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: struct (nullable = true)
 |    |    |-- z: struct (nullable = true)
 |    |    |    |-- aa: string (nullable = true)

I have the above nested schema where I want to change column z from struct to string.

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: struct (nullable = true)
 |    |    |-- z: string (nullable = true)

I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.

Is there a way to do this with some udf or any other way?

I know it's easy to do this via to_json but the nested array of struct is causing issues.

Chirag Sejpal
  • 877
  • 2
  • 9
  • 17

2 Answers2

2

cast as in higher order function

df3=df.withColumn('x', expr('transform(x, s-> struct(s.y as y,cast(to_json(s.z) as string) as z))')).printSchema()

root
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- y: struct (nullable = true)
 |    |    |-- z: string (nullable = true)
Chirag Sejpal
  • 877
  • 2
  • 9
  • 17
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Doesn't work - I get `org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct(NamePlaceholder(), namedlambdavariable() due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 25` – Chirag Sejpal Apr 28 '22 at 16:53
  • 1
    Instead of `s.y` you should use `s.y as y` – ZygD Apr 28 '22 at 18:10
  • @ZygD That seems to have worked but it removes the key aa from the stringified version of the struct. Any way to have the key in the string as well? – Chirag Sejpal Apr 28 '22 at 18:49
  • Hah doing cast(to_json(s.z) as string) seems to keep the key as well. – Chirag Sejpal Apr 28 '22 at 18:51
  • wwnde thanks for the great solution - I edited it a bit. Also, similar question here https://stackoverflow.com/questions/72037029/update-a-highly-nested-column-from-string-to-struct - I'm trying to use your logic for that as well but I'm not sure how to pass in schema for for_json to work inside expr/transform. Any suggestions? Can you take a look at that question too? Also thanks @ZygD – Chirag Sejpal Apr 28 '22 at 20:05
  • https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/from_json - used this as a reference it works for the other question too! – Chirag Sejpal Apr 29 '22 at 00:40
  • @Chirag Sejpal, sorry was at work. answered, could you please let me know if that helps. Happy to assist further – wwnde Apr 29 '22 at 12:05
1

For your specific case, you can do it with built-in functions on Spark 2.4 or Spark 3.0

Spark 2.4

You can use arrays_zip as follows:

  • first, you create arrays for each field you want to have as struct element of your array
  • second, you use arrays_zip to zip those fields

Here is the complete code, with df your input dataframe:

import org.apache.spark.functions.{arrays_zip, col}

df.withColumn("x",
      arrays_zip(
        col("x").getField("y").alias("y"),
        col("x").getField("z").getField("aa").alias("z")
      ))

Spark 3.0

You can use transform to rebuild element struct of your array, as follows:

df.withColumn("x", transform(
      col("x"),
      element => struct(
        element.getField("y").alias("y"),
        element.getField("z").getField("aa").alias("z")
      )
    ))
Vincent Doba
  • 4,343
  • 3
  • 22
  • 42