I have a schema of this form from a json file:
root
|-- fruit_id: string (nullable = true)
|-- fruit_type: array (nullable = true)
| |-- name: string (nullable = true)
| |-- info: struct (nullable = true)
| |-- fruit_quality: array (nullable = true)
| | |-- quality: string (nullable = true)
| |-- likes: string (containsNull = true)
| |-- finance: struct (nullable = true)
| | |-- last_year_price: string (nullable = true)
| | |-- current_price: string (nullable = true)
| |-- shops: struct (nullable = true)
| | |-- shop1: string (nullable = true)
| | |-- shop2: string (nullable = true)
|-- season: string (nullable = true)
How can I get it of this form?
root
|-- fruit_id: string (nullable = true)
|-- fruit_type_name: string (nullable = true)
|-- fruit_type_info_fruit_quality_quality: string (nullable = true)
|-- fruit_type_info_likes: string (nullable = true)
|-- fruit_type_finance_last_year_price: string (nullable = true)
|-- fruit_type_finance_current_price: string (nullable = true)
|-- fruit_type_shops_shop1: string (nullable = true)
|-- fruit_type_shops_shop2: string (nullable = true)
|-- season: string (nullable = true)
This is for the case of fruits. How would I flatten it similar way if I receive a file with info on vegetables
?
I am facing issue while flattening the array part. I am able to flatten structs inside structs, I followed this: link
I also added this piece of code to code on above link, to see if this approach would work:
import pyspark.sql.functions as F
array_cols = [c[0] for c in df.dtypes if c[1][:6] == 'array']
df = df.select(
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in array_cols
for c in df.select(nc+'.*').columns])
But it's not working.
I then checked this link as well: link
But here issue is if I want to flatten the json file of fruits, It is possible, but then if I send a json file of vegetables with similar schema, I'll have to redefine the code.
Another approach I went for was converting an array to struct & then I could use the flatten the nested structs, but that wasn't helpful.
Lastly, I checked this link as well: link
But this approach threw an error, saying flattening not possible, since I have array of structs & not an array of array.
So how can I solve this?