Spark: nested json data and duplicate column names (pyspark)

Question

I'm dealing with deeply nested json data. My goal is to flatten the data. I know I can do this by using the following notation in the case when the nested column I want is called attributes.id, where id is nested in the attributes column:

df = df.select('attributes.id')

The problem is that there is already a column in df called id and since spark only keeps the last part after . as the column name, I now have duplicated column names. What is the best way of dealing with this? Ideally the new column will be called attributes_id as to differentiate it from the id column.

score 0 · Answer 1 · answered Aug 17 '20 at 16:11

0

Rename the column using .withColumn

(or)

flatten the dataframe then use .toDF() to rename the columns in the dataframe.

Example:

#sample json file data
{"id":1,"attributes":{"id":10}}

spark.read.json("<json_file_path>").printSchema()
#root
# |-- attributes: struct (nullable = true)
# |    |-- id: long (nullable = true)
# |-- id: long (nullable = true)

spark.read.json("<json_file_path>").\
withColumn("attributes_id",col("attributes.id")).\
drop("attributes").\
show()
#+---+-------------+
#| id|attributes_id|
#+---+-------------+
#|  1|           10|
#+---+-------------+

#or using toDF
columns=['id','attributes_id']

spark.read.json("<json_file_path>").\
select("id","attributes.*").\
toDF(*columns).\
show()
#+---+-------------+
#| id|attributes_id|
#+---+-------------+
#|  1|           10|
#+---+-------------+

If you want to flatten dynamically then use this link answer.

answered Aug 17 '20 at 16:11

notNull

30,258
4
35
50

Thanks, which one would be considered the best method? Or is there no real difference? – fungie127 Aug 17 '20 at 16:18
I think using `.withColumn()` gives you more control over which columns to be mapped. – notNull Aug 17 '20 at 16:27
Ok thanks. Would you know how to extract a nested field with a `.` in its name? For example, I'm trying to extract the field `Dr.Web` from `attributes.categories` from the data but due to the `.` in `Dr.Web` it gets confused and thinks `Dr` is a level in the json. I get the following error: `AnalysisException: No such struct field Dr in Dr.Web` – fungie127 Aug 18 '20 at 15:04
We need to escape it.. refer this link how to escape `.` https://stackoverflow.com/questions/63341060/how-do-you-escape-dot-in-dot-syntax-e-g-in-get-json-object/63341706#63341706 – notNull Aug 18 '20 at 15:27
Any idea how to do this for pyspark, the link was for scala? – fungie127 Aug 18 '20 at 15:41

score 0 · Answer 2 · answered Aug 17 '20 at 16:13

0

After the select you can .alias("attributes_id")

answered Aug 17 '20 at 16:13

emendez

430
5
10

Spark: nested json data and duplicate column names (pyspark)

2 Answers2