5

How to parse xml file containing xml data within one of it's column itself?

In one of our project, we receive xml files, in which some of the columns store another xml. While loading this data to dataframe, the inner xml is getting converted to StringType (which is not intended), so not being able to get to the nodes while querying the data (using dot operator).

I have looked around for answers vividly in net, but no luck. Found one open issue exactly identical to my use case in GitHub. The link is here.

https://github.com/databricks/spark-xml/issues/140

My xml source file looks like below.

+------+--------------------+
|  id  |               xml  |
+------+--------------------+
| 6723 |<?xml version="1....|
| 6741 |<?xml version="1....|
| 6774 |<?xml version="1....|
| 6735 |<?xml version="1....|
| 6828 |<?xml version="1....|
| 6764 |<?xml version="1....|
| 6732 |<?xml version="1....|
| 6792 |<?xml version="1....|
| 6754 |<?xml version="1....|
| 6833 |<?xml version="1....|
+------+--------------------+

In SQL Server, to store xml within a database column, there is the XML datatype but same is not present in Spark SQL.

Has anyone come around the same issue and found any workaround? If yes, please share. We're using Spark Scala.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Gourav Dutta
  • 533
  • 4
  • 10
  • Hi Gourav, did you found the work around? – devanathan May 24 '18 at 16:21
  • Please refer this link https://stackoverflow.com/questions/52078230/apply-a-logic-for-a-particular-column-in-dataframe-in-spark/52427120#52427120][1] I too had the same issue. – Rajkiran Sep 21 '18 at 13:10

2 Answers2

1

You can use something like below :

df.withColumn("ID", split(col("xml"), ",").getItem(1))

where ID is a new field name and in

col("xml")

xml is the dataframe field name.
"," - separated by delimiter comma (use as per requirement)

zx485
  • 28,498
  • 28
  • 50
  • 59
Amit Dass
  • 41
  • 6
1
row_counter = Row('id', 'abc')

def parser_xml(string_xml):
  root = ET.fromstring(string_xml[0])
  col1= root.find('visitor').attrib['id']
  col2= root.find('visitor').attrib['abc']
  return row_counter(id, abc)

data = rdd.map(lambda string_file: parser_xml(string_file)) 
df_xml = spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
display(df_xml)