I am getting very strange error in PySpark and also in Synapse data flow.
I am reading JSON file with below query but getting duplicate column error even though there is no duplicate column. I can read it using other tools and JSON validator and also with data flow but not in PySpark.
PySpark query is as below:
df = (
spark.read.option("multiline", "true")
.options(encoding="UTF-8")
.load(
"abfss://<Container>]@<DIR>.dfs.core.windows.net/export28.json", format="json"
)
)
This is stacktrace I get:
AnalysisException: Found duplicate column(s) in the data schema:
amendationcommentkey
,amendationreasonkey
,amendationregulatoryproofkey
Traceback (most recent call last):File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 204, in load return self._df(self._jreader.load(path))
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None
pyspark.sql.utils.AnalysisException: Found duplicate column(s) in the data schema:
amendationcommentkey
,amendationreasonkey
,amendationregulatoryproofkey