Remove duplicate column name in a Pyspark Dataframe from a json column nested object

Question

So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:

  df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
  df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")

Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:

 root
 |-- Id: integer (nullable = true)
 |-- json_data: struct (nullable = false)
 |    |-- object: struct (nullable = true)
 |    |    |-- Id: long (nullable = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- ... many more properties

Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:

df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)

I get the following error:

AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;

How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:

 root
 |-- Id: integer (nullable = true)
 |-- json_data: struct (nullable = false)
 |    |-- object: struct (nullable = true)
 |    |    |-- Id: long (nullable = true)
 |    |    |-- ... many more properties

Note how the root of the json also has a property called Id, I would like for that to be untouched!

Can someone help me with this?

score 0 · Accepted Answer · answered Mar 03 '21 at 09:37

Instead of modifying and remove the duplicate column with same name after having used:

df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")

I went with a solution where I used regex substitution on the JsonCol beforehand:

df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))

This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.

Putting it all together:

df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")

This gave me my expected output, and afterwards I was able to save it properly:

root
 |-- Id: integer (nullable = true)
 |-- json_data: struct (nullable = false)
 |    |-- object: struct (nullable = true)
 |    |    |-- Id: long (nullable = true)

Remove duplicate column name in a Pyspark Dataframe from a json column nested object

1 Answers1

Linked