Found duplicate column in one of the json when running spark.read.json even though there are no duplicate columns

Question

I am getting very strange error in PySpark and also in Synapse data flow.

I am reading JSON file with below query but getting duplicate column error even though there is no duplicate column. I can read it using other tools and JSON validator and also with data flow but not in PySpark.

PySpark query is as below:

df = (
    spark.read.option("multiline", "true")
    .options(encoding="UTF-8")
    .load(
        "abfss://<Container>]@<DIR>.dfs.core.windows.net/export28.json", format="json"
    )
)

This is stacktrace I get:

AnalysisException: Found duplicate column(s) in the data schema: amendationcommentkey, amendationreasonkey, amendationregulatoryproofkey Traceback (most recent call last):

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load return self._df(self._jreader.load(path))

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None

pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema: amendationcommentkey, amendationreasonkey, amendationregulatoryproofkey

score 1 · Answer 1 · answered Nov 27 '21 at 07:43

This indicates that if we have any duplicate names in top level columns as well in nested structure.

Below is the statement from Apache Spark website:

In Spark 3.1, the Parquet, ORC, Avro and JSON datasources throw the exception org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema in read if they detect duplicate names in top-level columns as well in nested structures. The datasources take into account the SQL config spark.sql.caseSensitive while detecting column name duplicates.

Try to use your command as below as everything depends on the schema, as this code successfully helped in my case.

Sch = spark.read.json(schemaPath)
schema = Sch.schema

df = spark.read.option("multiline","true").schema(schema).json(f"{json_path}")

Also refer to these SO's( SO1, SO2, SO3). As the authors gave great explanation in different scenarios.

Did this answer resolve your problem? – SaiKarri-MT Nov 29 '21 at 12:37 — SaiKarri-MT, Nov 29 '21 at 12:37
Setting _spark.sql.caseSensitive_ to _true_ helped me. – Eyal Jun 16 '22 at 09:08 — Eyal, Jun 16 '22 at 09:08

Found duplicate column in one of the json when running spark.read.json even though there are no duplicate columns

1 Answers1