2

I need to know how to how to parse XML file in Spark. I am receiving streaming data from kafka and then need to parse that streamed data.

Here is my Spark code to receive data:

directKafkaStream.foreachRDD(rdd ->{
            rdd.foreach(s ->{
                System.out.println("&&&&&&&&&&&&&&&&&" +s._2 );
            });

And results:

<root>
<student>
<name>john</name>
<marks>90</marks>
</student>
</root>

How to pass these XML elements?

SiHa
  • 7,830
  • 13
  • 34
  • 43
user6325753
  • 585
  • 4
  • 10
  • 33
  • 1
    Have you searched for previous questions on this? Such as: http://stackoverflow.com/questions/33078221/xml-processing-in-spark – Binary Nerd Sep 26 '16 at 07:11
  • @Binary Nerd, Thanks for response. My spark application is reading data line by line. So i need to parse line by line without using start element and/or end element. – user6325753 Sep 26 '16 at 08:43

2 Answers2

3

Thanks guys.. Problem Solved. Here is the solution.

String xml = "<name>xyz</name>";
DOMParser parser = new DOMParser();
try {
    parser.parse(new InputSource(new java.io.StringReader(xml)));
    Document doc = parser.getDocument();
    String message = doc.getDocumentElement().getTextContent();
    System.out.println(message);
} catch (Exception e) {
    // handle SAXException 
}
Binary Nerd
  • 13,872
  • 4
  • 42
  • 44
user6325753
  • 585
  • 4
  • 10
  • 33
  • @MasudRahman, please look at the mentioned link https://stackoverflow.com/questions/33078221/xml-processing-in-spark/40653300#40653300 – user6325753 Dec 05 '17 at 15:55
2

As you are processing streaming data, it would be helpful to use databricks's spark-xml lib for xml data processing.

Reference: https://github.com/databricks/spark-xml

Amit Kulkarni
  • 654
  • 5
  • 8
  • Thanks for response. My spark application is reading data line by line. So i need to parse line by line without using start element and/or end element. – user6325753 Sep 26 '16 at 08:42
  • I spent couple of hours with this, and then I found it does not read self-closing rows. – Masud Rahman Dec 05 '17 at 23:56