2

I have an Excel sheet where one column has XMLs (each row is a different xml). I am trying to use Pyspark and spark-xml to parse these, by doing df = spark.read.format('xml').options(rowTag = 'book').load(___).

The load works fine when you specify an xml file, but is it possible to read in the Excel sheet and loop in those xmls to be parsed without having to convert each one to its own xml file?

formicaman
  • 1,317
  • 3
  • 16
  • 32
  • See if this helps https://stackoverflow.com/questions/44196741/how-to-construct-dataframe-from-a-excel-xls-xlsx-file-in-scala-spark – Shubham Jain Jun 15 '20 at 14:16
  • Thanks. This to turns it into a df. I don't think that would work in spark-xml? – formicaman Jun 15 '20 at 14:23
  • Read as dataframe then for each column parse the xml.. or if your file is not bigger than 2 gb then use normal xml parser and pandas as that will be faster – Shubham Jain Jun 15 '20 at 14:25
  • I have been using lxml/ElementTree but the "load(__)" of spark-xml only seems to work if it gets passed an .xml file. – formicaman Jun 15 '20 at 14:32

1 Answers1

1

You need to first read the Excel file as a Spark DataFrame. Then you can further parse a String column containing XML with the from_xml function. That gives a new column with the parsed data from the XML column.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173