So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?