1

My schema looks like this

root
 |-- source: string (nullable = true)
 |-- results: array (nullable = true)
 |    |-- content: struct (containsNull = true)
 |    |    |-- ptype: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- verb: string (nullable = true)
 |    |    |-- foobar: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- fooId: integer (nullable = true)
 |-- date: string (nullable = false)
 |-- hour: string (nullable = false)

I have a df with the above data. I want to create a dataframe without fooId. I cannot use drop since its a nested column.

The tricky part is results is an array and has content as a struct. Inside of which there is fooId

What would be the cleanest way to accomplish this?

suprita shankar
  • 1,554
  • 2
  • 16
  • 47
  • @user6910411 The concept is the same but the structure is different. Here it is a struct with an array. – suprita shankar Nov 27 '18 at 00:44
  • Since it's an array, I believe the easiest way would be to use an `UDF` to remove the data. Or you could `explode` the array first but that would change the dataframe structure. – Shaido Nov 27 '18 at 01:46
  • I would prefer the former approach. So I have to map and then pass it to the udf?Can you elaborate a little more? – suprita shankar Nov 27 '18 at 02:42
  • You can create an udf that takes the `results` array as input, then do all necessary processing inside it (i.e. remove `fooId`). You can use `withColumn` to overwrite the `results` column with what the udf returns. – Shaido Nov 27 '18 at 02:54

0 Answers0