Databricks has released new version to read xml to Spark DataFrame
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.6.0</version>
</dependency>
Input XML file I used on this example is available at GitHub repository.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("persons.xml")
Schema
root
|-- _id: long (nullable = true)
|-- dob_month: long (nullable = true)
|-- dob_year: long (nullable = true)
|-- firstname: string (nullable = true)
|-- gender: string (nullable = true)
|-- lastname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- salary: struct (nullable = true)
| |-- _VALUE: long (nullable = true)
| |-- _currency: string (nullable = true)
Outputs:
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|
| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+
Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations
Hope this helps !!