2

I have xml file having multiple rowstags. i need to convert this xml to proper dataframe. i have used spark-xml which is only handling single row tag.

xml data is below

<?xml version='1.0' encoding='UTF-8' ?>
<generic
    xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000">
    <HEADER compName="ABGROUP" dateCreated="2018-03-09T09:38:51"/>
    <COVERSHEET>
        <ESTIMATE_INFO estimateName="2016-09-28-133907" priceList="YHTRDF" laborEff="Restoration/Service/Remodel" claimNumber="Hdchtdhtdh" policyNumber="Utfhtdhtd" typeOfLoss="Collapse" causeOfLoss="Collapse" roofDamage="0" deprMat="1" deprNonMat="1" deprRemoval="1" deprOandP="1" deprTaxes="1" estimateType="Mixed"/>
        <ADDRESSES>
            <ADDRESS type="Property" street="Pkwy" city="Lehi" state="UT" zip="0000" primary="1"/>
        </ADDRESSES>
        <CONTACTS>
            <CONTACT type="ClaimRep" name="Vytvyfv"/>
            <CONTACT type="Estimator" name="Vytvyfv"/>
        </CONTACTS>
        <DATES loss="2016-09-28T19:38:23Z" inspected="2016-09-28T19:39:27Z" completed="2018-03-09T09:38:49Z" received="2016-09-28T19:39:24Z" entered="2016-09-28T19:39:07Z" contacted="2016-09-28T19:39:26Z"/>
    </COVERSHEET>
    <COVERAGES>
        <COVERAGE coverageName="Dwelling" coverageType="0" id="1"/>
        <COVERAGE coverageName="Other Structures" coverageType="1" id="2"/>
        <COVERAGE coverageName="Contents" coverageType="2" id="3"/>
    </COVERAGES>
    <LINE_ITEM_DETAIL>
        <COV_BREAKDOWN>
            <COV_AMOUNTS desc="Dwelling"/>
            <COV_AMOUNTS desc="Other Structures"/>
            <COV_AMOUNTS desc="Contents"/>
        </COV_BREAKDOWN>
    </LINE_ITEM_DETAIL>
    <RECAP_BY_ROOM>
        <RECAP_GROUP desc="2016-09-28-133907"/>
    </RECAP_BY_ROOM>
</generic>
LUZO
  • 1,019
  • 4
  • 19
  • 42

1 Answers1

4

I would suggest you to read it as one rowTag (generic element) and later explode according to your needs

First of all, attributes of the elements should not contain line delimiter so

<generic
    xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000">

should be

<generic xmlns="http://xactware.com/generic.xsd" majorVersion="28" minorVersion="300" transactionId="0000">

Once above amendment is done, you can read it using databricks xml as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rowTag", "generic") \
    .option("valueTag", False) \
    .load("path to xml file")

which should give you

+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------+----------------------+-------------+-------------+--------------+-------------------------------+
|COVERAGES                                                          |COVERSHEET                                                                                                                                                                                                                                                                                                                                                  |HEADER                         |LINE_ITEM_DETAIL                                   |RECAP_BY_ROOM         |_majorVersion|_minorVersion|_transactionId|_xmlns                         |
+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------+----------------------+-------------+-------------+--------------+-------------------------------+
|[[[Dwelling, 0, 1,], [Other Structures, 1, 2,], [Contents, 2, 3,]]]|[[[Lehi, 1, UT, Pkwy, Property, 0,]], [[[Vytvyfv, ClaimRep,], [Vytvyfv, Estimator,]]], [2018-03-09T09:38:49Z, 2016-09-28T19:39:26Z, 2016-09-28T19:39:07Z, 2016-09-28T19:39:27Z, 2016-09-28T19:38:23Z, 2016-09-28T19:39:24Z,], [Collapse, Hdchtdhtdh, 1, 1, 1, 1, 1, 2016-09-28-133907, Mixed, Restoration/Service/Remodel, Utfhtdhtd, YHTRDF, 0, Collapse,]]|[ABGROUP, 2018-03-09T09:38:51,]|[[[[Dwelling,], [Other Structures,], [Contents,]]]]|[[2016-09-28-133907,]]|28           |300          |0             |http://xactware.com/generic.xsd|
+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------+----------------------+-------------+-------------+--------------+-------------------------------+

root
 |-- COVERAGES: struct (nullable = true)
 |    |-- COVERAGE: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _coverageName: string (nullable = true)
 |    |    |    |-- _coverageType: long (nullable = true)
 |    |    |    |-- _id: long (nullable = true)
 |    |    |    |-- false: string (nullable = true)
 |-- COVERSHEET: struct (nullable = true)
 |    |-- ADDRESSES: struct (nullable = true)
 |    |    |-- ADDRESS: struct (nullable = true)
 |    |    |    |-- _city: string (nullable = true)
 |    |    |    |-- _primary: long (nullable = true)
 |    |    |    |-- _state: string (nullable = true)
 |    |    |    |-- _street: string (nullable = true)
 |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- _zip: long (nullable = true)
 |    |    |    |-- false: string (nullable = true)
 |    |-- CONTACTS: struct (nullable = true)
 |    |    |-- CONTACT: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _name: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |    |-- false: string (nullable = true)
 |    |-- DATES: struct (nullable = true)
 |    |    |-- _completed: string (nullable = true)
 |    |    |-- _contacted: string (nullable = true)
 |    |    |-- _entered: string (nullable = true)
 |    |    |-- _inspected: string (nullable = true)
 |    |    |-- _loss: string (nullable = true)
 |    |    |-- _received: string (nullable = true)
 |    |    |-- false: string (nullable = true)
 |    |-- ESTIMATE_INFO: struct (nullable = true)
 |    |    |-- _causeOfLoss: string (nullable = true)
 |    |    |-- _claimNumber: string (nullable = true)
 |    |    |-- _deprMat: long (nullable = true)
 |    |    |-- _deprNonMat: long (nullable = true)
 |    |    |-- _deprOandP: long (nullable = true)
 |    |    |-- _deprRemoval: long (nullable = true)
 |    |    |-- _deprTaxes: long (nullable = true)
 |    |    |-- _estimateName: string (nullable = true)
 |    |    |-- _estimateType: string (nullable = true)
 |    |    |-- _laborEff: string (nullable = true)
 |    |    |-- _policyNumber: string (nullable = true)
 |    |    |-- _priceList: string (nullable = true)
 |    |    |-- _roofDamage: long (nullable = true)
 |    |    |-- _typeOfLoss: string (nullable = true)
 |    |    |-- false: string (nullable = true)
 |-- HEADER: struct (nullable = true)
 |    |-- _compName: string (nullable = true)
 |    |-- _dateCreated: string (nullable = true)
 |    |-- false: string (nullable = true)
 |-- LINE_ITEM_DETAIL: struct (nullable = true)
 |    |-- COV_BREAKDOWN: struct (nullable = true)
 |    |    |-- COV_AMOUNTS: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _desc: string (nullable = true)
 |    |    |    |    |-- false: string (nullable = true)
 |-- RECAP_BY_ROOM: struct (nullable = true)
 |    |-- RECAP_GROUP: struct (nullable = true)
 |    |    |-- _desc: string (nullable = true)
 |    |    |-- false: string (nullable = true)
 |-- _majorVersion: long (nullable = true)
 |-- _minorVersion: long (nullable = true)
 |-- _transactionId: long (nullable = true)
 |-- _xmlns: string (nullable = true)

Inspecting the above dataframe, you can simplify it by doing the following

from pyspark.sql import functions as f
df.select(f.col('COVERAGES.COVERAGE'), f.col('COVERSHEET.ADDRESSES.ADDRESS.*'), f.col('COVERSHEET.CONTACTS.CONTACT'), f.col('COVERSHEET.DATES.*'), f.col('COVERSHEET.ESTIMATE_INFO.*'), f.col('HEADER.*'), f.col('LINE_ITEM_DETAIL.COV_BREAKDOWN.COV_AMOUNTS'), f.col('RECAP_BY_ROOM.RECAP_GROUP.*'), f.col('_majorVersion'), f.col('_minorVersion'), f.col('_transactionId'), f.col('_xmlns')).show(truncate=False)

which should give you dataframe with schema as below

+-----------------------------------------------------------------+-----+--------+------+-------+--------+----+-----+---------------------------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+------------+------------+--------+-----------+----------+------------+----------+-----------------+-------------+---------------------------+-------------+----------+-----------+-----------+-----+---------+-------------------+-----+-----------------------------------------------+-----------------+-----+-------------+-------------+--------------+-------------------------------+
|COVERAGE                                                         |_city|_primary|_state|_street|_type   |_zip|false|CONTACT                                      |_completed          |_contacted          |_entered            |_inspected          |_loss               |_received           |false|_causeOfLoss|_claimNumber|_deprMat|_deprNonMat|_deprOandP|_deprRemoval|_deprTaxes|_estimateName    |_estimateType|_laborEff                  |_policyNumber|_priceList|_roofDamage|_typeOfLoss|false|_compName|_dateCreated       |false|COV_AMOUNTS                                    |_desc            |false|_majorVersion|_minorVersion|_transactionId|_xmlns                         |
+-----------------------------------------------------------------+-----+--------+------+-------+--------+----+-----+---------------------------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+------------+------------+--------+-----------+----------+------------+----------+-----------------+-------------+---------------------------+-------------+----------+-----------+-----------+-----+---------+-------------------+-----+-----------------------------------------------+-----------------+-----+-------------+-------------+--------------+-------------------------------+
|[[Dwelling, 0, 1,], [Other Structures, 1, 2,], [Contents, 2, 3,]]|Lehi |1       |UT    |Pkwy   |Property|0   |null |[[Vytvyfv, ClaimRep,], [Vytvyfv, Estimator,]]|2018-03-09T09:38:49Z|2016-09-28T19:39:26Z|2016-09-28T19:39:07Z|2016-09-28T19:39:27Z|2016-09-28T19:38:23Z|2016-09-28T19:39:24Z|null |Collapse    |Hdchtdhtdh  |1       |1          |1         |1           |1         |2016-09-28-133907|Mixed        |Restoration/Service/Remodel|Utfhtdhtd    |YHTRDF    |0          |Collapse   |null |ABGROUP  |2018-03-09T09:38:51|null |[[Dwelling,], [Other Structures,], [Contents,]]|2016-09-28-133907|null |28           |300          |0             |http://xactware.com/generic.xsd|
+-----------------------------------------------------------------+-----+--------+------+-------+--------+----+-----+---------------------------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+------------+------------+--------+-----------+----------+------------+----------+-----------------+-------------+---------------------------+-------------+----------+-----------+-----------+-----+---------+-------------------+-----+-----------------------------------------------+-----------------+-----+-------------+-------------+--------------+-------------------------------+

root
 |-- COVERAGE: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _coverageName: string (nullable = true)
 |    |    |-- _coverageType: long (nullable = true)
 |    |    |-- _id: long (nullable = true)
 |    |    |-- false: string (nullable = true)
 |-- _city: string (nullable = true)
 |-- _primary: long (nullable = true)
 |-- _state: string (nullable = true)
 |-- _street: string (nullable = true)
 |-- _type: string (nullable = true)
 |-- _zip: long (nullable = true)
 |-- false: string (nullable = true)
 |-- CONTACT: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _name: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- false: string (nullable = true)
 |-- _completed: string (nullable = true)
 |-- _contacted: string (nullable = true)
 |-- _entered: string (nullable = true)
 |-- _inspected: string (nullable = true)
 |-- _loss: string (nullable = true)
 |-- _received: string (nullable = true)
 |-- false: string (nullable = true)
 |-- _causeOfLoss: string (nullable = true)
 |-- _claimNumber: string (nullable = true)
 |-- _deprMat: long (nullable = true)
 |-- _deprNonMat: long (nullable = true)
 |-- _deprOandP: long (nullable = true)
 |-- _deprRemoval: long (nullable = true)
 |-- _deprTaxes: long (nullable = true)
 |-- _estimateName: string (nullable = true)
 |-- _estimateType: string (nullable = true)
 |-- _laborEff: string (nullable = true)
 |-- _policyNumber: string (nullable = true)
 |-- _priceList: string (nullable = true)
 |-- _roofDamage: long (nullable = true)
 |-- _typeOfLoss: string (nullable = true)
 |-- false: string (nullable = true)
 |-- _compName: string (nullable = true)
 |-- _dateCreated: string (nullable = true)
 |-- false: string (nullable = true)
 |-- COV_AMOUNTS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _desc: string (nullable = true)
 |    |    |-- false: string (nullable = true)
 |-- _desc: string (nullable = true)
 |-- false: string (nullable = true)
 |-- _majorVersion: long (nullable = true)
 |-- _minorVersion: long (nullable = true)
 |-- _transactionId: long (nullable = true)
 |-- _xmlns: string (nullable = true)

Now you can transform it into multiple rows depending to COVERAGE or CONTACT or COV_AMOUNTS columns as they are the only columns that can be exploded to multiple rows.

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • great ramesh, but i am not able to read it. it is throwing me, emty dataframe. i have removed line delimiter manually in xml file. have you used any process to remove? – LUZO May 20 '18 at 14:18
  • I removed manually too. try part by part to track down the bug. I mean you can try the first part of reading only . – Ramesh Maharjan May 20 '18 at 14:24
  • resolved the bug. two things, what valueTag will do? and why we are using 'df.select' followed by columns ? what exactly happening there? – LUZO May 20 '18 at 14:39
  • i am copy pasting it `valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE.` so it was creating empty values so I put false for that. df.select columns is just flattening the struct columns . Array columns cannot be flatterned that easily so I left them otherwise all the struct columns are flattened – Ramesh Maharjan May 20 '18 at 15:00