Spark-XML sort Dataframe schema by default

Question

I´m triying to read a SAP ABAB XML via Spark using Databricks 'Spark-XML' jar.

My problem is the output dataframe schema is sorted alphabetically by default, I want to mantain the XML schema order.

XML file:

<?xml version="1.0" encoding="utf-16"?><asx:abap xmlns:asx="http://www.sap.com/abapxml" version="1.0"><asx:values><TAB><item>...

Spark Dataframe:

df = spark.read.format('com.databricks.spark.xml')\
     .option('rowTag', 'item')\
     .option('encoding', 'UTF-16')\
     .load("path/to/file/.xml")

Result:

df.printSchema()

root
 |-- AEDAT: string (nullable = true)
 |-- ASTNR: long (nullable = true)
 |-- BWD: long (nullable = true)
...

Is there any option to not sort the result?

Thanks!

score 2 · Answer 1 · answered Dec 15 '22 at 02:42

2

No, although you can always df.select("thing_I_want_first", "thing_I_want_second"), though this would require you to know the order they appear in the XML.

(What if they don't appear in the same order in the XML though? it would be ambiguous anwyay. There is not much meaning to the ordering of cols in a DataFrame either.)

answered Dec 15 '22 at 02:42

Sean Owen

66,182
23
141
173

Hi, thanks for your answer. This XML is a SAP ODP extraction via RFC function and i want to mantain, if it is possible the same schema order. If the output dataframe schema with Spark-XML is always sorted alphabetically by default i will try to find another solution. With `pandas.read_xml` doesn´t change the schema. – Cir02 Dec 15 '22 at 07:39

score 2 · Answer 2 · answered Dec 22 '22 at 18:37

You can set up order of elements by defining the schema. If you have XSD you can try to create schema from XSD and then read XML to DF using the schema. https://github.com/databricks/spark-xml

import com.databricks.spark.xml.util.XSDToSchema
import java.nio.file.Paths

val schema = XSDToSchema.read(Paths.get("/path/to/your.xsd"))
val df = spark.read.schema(schema)....xml(...)

The other options, as you mentioned in comments, would be to use pandas.read.xml

Spark-XML sort Dataframe schema by default

2 Answers2