3

I used Databricks spark-xml package to read a XML file into Spark. The file hast the following datastructure:

<lib>
 <element>
  <genre>Thriller</genre>
  <dates>
   <date>2000-10-01</date>
   <date>2020-10-01</date>
  </dates>
 </element>
 <element>
  <genre>SciFi</genre>
  <dates>
   <date>2015-10-01</date>
  </dates>
 </element>
</lib>    

After I load the data i get a dataframe that looks like:

root
 |-- genre: string (nullable = true)
 |-- publish_dates: struct (nullable = true)
 |    |-- publish_date: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

And "show" produces the following result:

gerne | dates
Th... | [WrappedArray(20...

Is it possible to get this result in the form:

gerne | date
Th... | 2000-...
Th... | 2020-...
Sci.. | 2015-...

I already tried:

val rdd = df.select("genre", "dates").rdd.map(row => (row(0), row(1))).flatMapValues(_.toString)

and

val rdd = df.select("genre", "dates").rdd.map(row => (row(0), Array(row(1)).toList)).flatMap {
  case (label, list) => {
    list.map( (_, label))
  }
}

and

val rdd = df.select("genre", "dates").withColumn("genre", explode($"dates"))

But nothing gets me the correct result, in fact the last version does not work at all, but i fould it via google.

Any help or tipps would be appreciated.

JanusJato
  • 33
  • 1
  • 4

1 Answers1

12

You mentioned your schema to be

root
 |-- genre: string (nullable = true)
 |-- publish_dates: struct (nullable = true)
 |    |-- publish_date: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

Which suggest that your dataframe to be

+--------+--------------------------------------+
|genre   |publish_dates                         |
+--------+--------------------------------------+
|thriller|[WrappedArray(2000-10-01, 2020-10-01)]|
+--------+--------------------------------------+

If this is the case then following explode function should give you the desired output

val finaldf = df.select("genre", "publish_dates").withColumn("publish_dates", explode(col("publish_dates.publish_date")))
finaldf.show(false)

Whose output is

+--------+-------------+
|genre   |publish_dates|
+--------+-------------+
|thriller|2000-10-01   |
|thriller|2020-10-01   |
+--------+-------------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97