from pyspark.sql.functions import *
def flatten_df(nested_df):
exist = True
while exist:
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) > 0:
print(nested_cols)
flat_df = nested_df.select(flat_cols +
[col("`"+nc+'`.`'+c+"`").alias((nc+'_'+c).replace(".","_"))
for nc in nested_cols
for c in nested_df.select("`"+nc+'`.*').columns])
nested_df=flat_df
#break
else:
exist = False
return flat_df
df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "GetDocument").load("/FileStore/tables/test.xml")
df1=flatten_df(df)
Here is the code I am using to flatten an xml document. Basically I want to take a xml with nested xml and flatten all of it to a single row without any structured datatypes, so each value is a column. The above code works for test cases I have done, but I have tried on a very large XML and after a couple rounds of flattening (in the while loop) it breaks with the following error:
'Ambiguous reference to fields StructField(_Id,StringType,true), StructField(_id,StringType,true);'
I assume because it is trying to create 2 seperate columns with the same name. How can I avoid this but keep my code generic for any XML?
One thing to note, it is okay to have arrays as a datatype for a column, I will be exploding those arrays to seperate rows in a later step.
Update example
Original DF -
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- id: string(nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
DF after function -
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children_id: string(nullable = true)
|-- children_att: array (nullable = true)
| |-- children_att_element_Order: long (nullable = true)
| |-- children_att_element_attval: string (nullable = true)