1

I have a schema of this form from a json file:

root
 |-- fruit_id: string (nullable = true)
 |-- fruit_type: array (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- info: struct (nullable = true)
 |         |-- fruit_quality: array (nullable = true)
 |         |    |-- quality: string (nullable = true)
 |         |-- likes: string (containsNull = true)    
 |    |-- finance: struct (nullable = true)
 |    |    |-- last_year_price: string (nullable = true)
 |    |    |-- current_price: string (nullable = true)
 |    |-- shops: struct (nullable = true)
 |    |    |-- shop1: string (nullable = true)
 |    |    |-- shop2: string (nullable = true)
 |-- season: string (nullable = true)

How can I get it of this form?

root
 |-- fruit_id: string (nullable = true)
 |-- fruit_type_name: string (nullable = true)
 |-- fruit_type_info_fruit_quality_quality: string (nullable = true)
 |-- fruit_type_info_likes: string (nullable = true)
 |-- fruit_type_finance_last_year_price: string (nullable = true)
 |-- fruit_type_finance_current_price: string (nullable = true)
 |-- fruit_type_shops_shop1: string (nullable = true)
 |-- fruit_type_shops_shop2: string (nullable = true)
 |-- season: string (nullable = true)

This is for the case of fruits. How would I flatten it similar way if I receive a file with info on vegetables ?

I am facing issue while flattening the array part. I am able to flatten structs inside structs, I followed this: link

I also added this piece of code to code on above link, to see if this approach would work:

import pyspark.sql.functions as F

 array_cols = [c[0] for c in df.dtypes if c[1][:6] == 'array']
 df = df.select(
                               [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in array_cols
                                for c in df.select(nc+'.*').columns])

But it's not working.

I then checked this link as well: link

But here issue is if I want to flatten the json file of fruits, It is possible, but then if I send a json file of vegetables with similar schema, I'll have to redefine the code.

Another approach I went for was converting an array to struct & then I could use the flatten the nested structs, but that wasn't helpful.

Lastly, I checked this link as well: link

But this approach threw an error, saying flattening not possible, since I have array of structs & not an array of array.

So how can I solve this?

0 Answers0