0
from pyspark.sql.functions import *
def flatten_df(nested_df):
    exist = True
    while exist:
        flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
        nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
        if len(nested_cols) > 0:
          print(nested_cols)
          flat_df = nested_df.select(flat_cols +
                                     [col("`"+nc+'`.`'+c+"`").alias((nc+'_'+c).replace(".","_"))
                                      for nc in nested_cols
                                      for c in nested_df.select("`"+nc+'`.*').columns])
          nested_df=flat_df
          #break
        else:
          exist = False
    return flat_df
df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "GetDocument").load("/FileStore/tables/test.xml")
df1=flatten_df(df)

Here is the code I am using to flatten an xml document. Basically I want to take a xml with nested xml and flatten all of it to a single row without any structured datatypes, so each value is a column. The above code works for test cases I have done, but I have tried on a very large XML and after a couple rounds of flattening (in the while loop) it breaks with the following error:

'Ambiguous reference to fields StructField(_Id,StringType,true), StructField(_id,StringType,true);'

I assume because it is trying to create 2 seperate columns with the same name. How can I avoid this but keep my code generic for any XML?

One thing to note, it is okay to have arrays as a datatype for a column, I will be exploding those arrays to seperate rows in a later step.

Update example

Original DF -

 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- id: string(nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

DF after function -

 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children_id: string(nullable = true)
 |-- children_att: array (nullable = true)
 |   |-- children_att_element_Order: long (nullable = true)
 |   |-- children_att_element_attval: string (nullable = true)
test acc
  • 561
  • 2
  • 11
  • 24
  • Are there any repeating groups? – thebluephantom Nov 06 '18 at 19:00
  • @thebluephantom Yes, but they come as arrays, which i will deal with later with exploding function. Those aren't the issue, those are coming just fine. – test acc Nov 06 '18 at 19:13
  • https://stackoverflow.com/questions/50429315/read-xml-in-spark see this, do you need to do so much wodk? – thebluephantom Nov 06 '18 at 19:26
  • @thebluephantom I updated my post with an example. All structs should be removed. AKA "Flattening" the row so nothing is nested. The ONLY nested items should be arrays with multiple values with I will later explode to multiple rows. No values should be nested (AKA all `struct`s are gone) – test acc Nov 06 '18 at 19:35
  • Provide an example. You do not want multiple rows but more columns I think. – thebluephantom Nov 06 '18 at 21:13
  • @thebluephantom Look at the example in my post. You are correct, I want more columns not rows. (I will be exploding arrays to multiple rows, but that is outside the scope of this question, for now, we will be leaving array column-types as is) – test acc Nov 06 '18 at 21:31
  • Ok, worth the effort? – thebluephantom Nov 06 '18 at 21:55
  • @thebluephantom Sorry, what do you mean "worth the effort"? – test acc Nov 06 '18 at 22:01
  • Seems like a lot of work required – thebluephantom Nov 06 '18 at 22:03
  • @thebluephantom I mean yes, but that's the requirement, outside my paygrade to decide if it's worth the effort. The code I have is working except for this specific new xml I am working with and throws the error shown – test acc Nov 06 '18 at 22:11
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/183215/discussion-between-thebluephantom-and-test-acc). – thebluephantom Nov 06 '18 at 22:17

1 Answers1

0

I was facing a similar issue and was able to parse my XML file as follow

  1. Install the following Maven library: “com.databricks:spark-xml_2.10:0.4.1” on Databricks
  2. Upload your file on DBFS using the following path: FileStore > tables > xml > sample_data
  3. Run the following code:

    data = spark.read.format("com.databricks.spark.xml").option("rootTag", "col1").option("rowTag", "col2").option("rowTag", "col3").load("dbfs:/FileStore/tables/sample_data.xml")
    

    display(data)

blue_skies
  • 13
  • 6