3

I have a DataFrame which contains multiple nested columns. The schema is not static and could change upstream of my Spark application. Schema evolution is guaranteed to always be backward compatible. An anonymized, shortened version of the schema is pasted below

root
 |-- isXPresent: boolean (nullable = true)
 |-- isYPresent: boolean (nullable = true)
 |-- isZPresent: boolean (nullable = true)
 |-- createTime: long (nullable = true)
<snip>
 |-- structX: struct (nullable = true)
 |    |-- hostIPAddress: integer (nullable = true)
 |    |-- uriArguments: string (nullable = true)
<snip>
 |-- structY: struct (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- cookies: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: array (valueContainsNull = true)
 |    |    |    |-- element: string (containsNull = true)
<snip>

The spark job is supposed to transform "structX.uriArguments" from string to map(string, string). There is a somewhat similar situation asked in this post. However, the answer assumes the schema is static and does not change. So case class does not work in my situation.

What would be the best way to transform structX.uriArguments without hard-coding the entire schema inside the code? The outcome should look like this:

root
 |-- isXPresent: boolean (nullable = true)
 |-- isYPresent: boolean (nullable = true)
 |-- isZPresent: boolean (nullable = true)
 |-- createTime: long (nullable = true)
<snip>
 |-- structX: struct (nullable = true)
 |    |-- hostIPAddress: integer (nullable = true)
 |    |-- uriArguments: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
<snip>
 |-- structY: struct (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- cookies: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: array (valueContainsNull = true)
 |    |    |    |-- element: string (containsNull = true)
<snip>

Thanks

nads
  • 378
  • 3
  • 13
  • I'm not sure I understand the question. The URI arguments are represented in form of _key1=value1&key2=value2&..._. You could split this easily into a Map[String, String] – nads May 21 '18 at 16:55

1 Answers1

3

You could try using the DataFrame.withColumn(). It allows you to reference nested fields. You could add a new map column and drop the flat one. This question shows how to handle structs with withColumn.

botchniaque
  • 4,698
  • 3
  • 35
  • 63
  • Thank you for the reply. The provided link allows me to create a new column and move it inside the structure. However, the goal is to *_replace_* structX.uriArguments. To do so, I think I need to do the following: 1) Create a new _uriArguments_ at the top level, 2) Drop _structX.uriArguments_. It is not straightforward, but this [post](https://stackoverflow.com/questions/32727279/dropping-a-nested-column-from-spark-dataframe) shows how and, 3) Move the newly created _uriArguments_ inside structX – nads May 21 '18 at 18:16
  • I stand corrected. The last part of the answer DOES replace an existing attribute in the structure :) – nads May 22 '18 at 04:42