I used Databricks spark-xml package to read a XML file into Spark. The file hast the following datastructure:
<lib>
<element>
<genre>Thriller</genre>
<dates>
<date>2000-10-01</date>
<date>2020-10-01</date>
</dates>
</element>
<element>
<genre>SciFi</genre>
<dates>
<date>2015-10-01</date>
</dates>
</element>
</lib>
After I load the data i get a dataframe that looks like:
root
|-- genre: string (nullable = true)
|-- publish_dates: struct (nullable = true)
| |-- publish_date: array (nullable = true)
| | |-- element: string (containsNull = true)
And "show" produces the following result:
gerne | dates
Th... | [WrappedArray(20...
Is it possible to get this result in the form:
gerne | date
Th... | 2000-...
Th... | 2020-...
Sci.. | 2015-...
I already tried:
val rdd = df.select("genre", "dates").rdd.map(row => (row(0), row(1))).flatMapValues(_.toString)
and
val rdd = df.select("genre", "dates").rdd.map(row => (row(0), Array(row(1)).toList)).flatMap {
case (label, list) => {
list.map( (_, label))
}
}
and
val rdd = df.select("genre", "dates").withColumn("genre", explode($"dates"))
But nothing gets me the correct result, in fact the last version does not work at all, but i fould it via google.
Any help or tipps would be appreciated.