How to read xml data from a spark dataframe column

Question

I have a spark dataframe, which has a columns value, key and others, value column has an xml as string

Now i would like to create a new dataframe where the xml content of value column is read as if i am reading spark.read.xml and append the other columns like key to the new DF

Is this possible?

I am generally reading the xml files using this

dfx = spark.read.load('books.xml', format='xml', rowTag='bks:books', valueTag="_ele_value")
dfx.schema

Trying to get the similar dataframe output when trying to read it from the value column (this is coming from kafka)

My xml has a deeply nested structure, just a example of books xml with 2 levels nested

<?xml version="1.0" encoding="UTF-8"?>
<bks:books xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:bks="urn:books"
           xsi:schemaLocation="urn:books books.xsd" xmlns:ot="http://maven.apache.org/POM/4.0.0">
    <book id="b001">
        <author>Brandon Sanderson</author>
        <title>Mistborn</title>
        <genre sub='epic'>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <review>
            <title>Wonderful</title>
            <content>I love the plot twist and the new magic</content>
        </review>
        <review>
            <title>Unbelievable twist</title>
            <content>The best book i ever read</content>
        </review>
        <sold>10</sold>
    </book>
    <book id="b002">
        <author>Brandon Sanderson</author>
        <title>Way of Kings</title>
        <genre sub='epic'>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <sold>10</sold>
    </book>
</bks:books>

This might be of some help https://stackoverflow.com/questions/40445816/load-xml-string-from-column-in-pyspark — DataWrangler, Jan 07 '20 at 13:57
the answer specified there doesn't seem to support my usecase, as per the example i have to extract the fields i need, but for me i want the whole xml to converted to nested structure — Geethanadh, Jan 07 '20 at 18:54
Does this answer your question? [Read XML in spark](https://stackoverflow.com/questions/50429315/read-xml-in-spark) — vaquar khan, Jan 08 '20 at 16:11

score 0 · Accepted Answer · answered Jan 08 '20 at 15:04

Looks like this can be achieved using XmlReader (but only in scala)

val rdd:RDD[String] = df.select("value").as[String].rdd
var schema: StructType = null
var parameters = collection.mutable.Map("rowTag" -> "bks:books", "valueTag" -> "_ele_value")

val new_df = new XmlReader().withRowTag("bks:books").withValueTag("_ele_value").withSchema(schema).xmlRdd(spark, rdd)

But the problem in this approach is, we lose any relationship between value and other columns in the initial dataframe

If anyone knows a way to link them let me know :)

How to read xml data from a spark dataframe column

1 Answers1