How to parse within CDATA in XML using Java

Question

Upon searching through existing CDATA discussions, none that I found were able to achieve what I'm attempting.

Is it possible to parse within CDATA where the tag is not unique?

Below is the XML document where I'm attempting to retrieve each field within the CDATA block that has multiple fields of interest (i.e. Data Loaded, Quality, Status, Index) on line 5 below. Each field is marked with the "li" tag within the CDATA block (even though it's a character data space):

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Document>
 <name>area Area Date: 2014-07-31</name>
 <Placemark><name>P07L327</name><Point><coordinates>-96.26879,85.19125</coordinates></Point><description><![CDATA[<ol><li> Data Loaded:  NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>]]></description><Style> id = "colorIcon"</Style></Placemark>
 <coordinates>-96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,45.14698,0 </coordinates>
</Document>
</kml>

Currently output is like this:

Name: <ol><li> Data Loaded:  NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>

From WITHIN the CDATA block, my intention is to output a new line for each field along with it's appropriate result.

Below is the code that's written up until now that gives the current output listed above:

    package com.lucy.seo;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.CharacterData;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
import org.w3c.dom.CDATASection;
import org.w3c.dom.Comment;
import org.w3c.dom.Text;
import org.xml.sax.SAXException;


public class ReadXMLFile {

public static void main(String[] args ) throws Exception {

File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/Oracle_Java_Project/Test_Doc.xml");
    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);

doc.getDocumentElement().normalize();

System.out.println("Root element :" + doc.getDocumentElement().getNodeName());

NodeList nList = doc.getElementsByTagName("Placemark");

System.out.println("----------------------------");

for (int temp = 0; temp < nList.getLength(); temp++) {
    Element element = (Element) nList.item(temp);
            NodeList name = element.getElementsByTagName("description");
            Element line = (Element) name.item(0);
            System.out.println("Name: " + getCharacterDataFromElement(line));
    }
}
public static String getCharacterDataFromElement(Element f) {

         NodeList list = f.getChildNodes();
         String data;

         for(int index = 0; index < list.getLength(); index++){
             if(list.item(index) instanceof CharacterData){
                 CharacterData child  = (CharacterData) list.item(index);
                 data = child.getData();

                 if(data != null && data.trim().length() > 0)
                    return child.getData();
             }
         }
         return "";
}
}

Appreciate any help towards this! -- thanks!

Sep 2, 2014 update

Updated edit with final solution. Thank you to all here that posted solutions and helped. Solution was broken up into two pieces of code / files due to library conflicts:

//First file which is input to the second file followed afterwards

import java.io.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.CharacterData;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;


public class ReadXMLFile {

public static void main(String[] args ) throws Exception {
PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html"));
System.setOut(out);
File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/raw_input.xml");
    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);


//optional, but recommended
//read this - http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
doc.getDocumentElement().normalize();

NodeList nList = doc.getElementsByTagName("Placemark");

    //create a buffered reader that connects to the console, we use it so we can read lines
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    System.out.println("<html xlmns=http://www.w3.org/1999/xhtml>");

for (int temp = 0; temp < nList.getLength(); temp++) {
                Node nNode = nList.item(temp);
                Element eElement = (Element) nNode;

    Element element = (Element) nList.item(temp);
            NodeList name = element.getElementsByTagName("description");
            Element line = (Element) name.item(0);

            System.out.println("<bracket><li>Name: " + eElement.getElementsByTagName("name").item(0).getTextContent() + "</li>");
            System.out.println("<description>Description: " + getCharacterDataFromElement(line) + "</description></bracket>");
    }
    System.out.println("</html>");

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}
public static String getCharacterDataFromElement(Element f) {

         NodeList list = f.getChildNodes();
         String data;

         for(int index = 0; index < list.getLength(); index++){
             if(list.item(index) instanceof CharacterData){
                 CharacterData child  = (CharacterData) list.item(index);
                 data = child.getData();

                 if(data != null && data.trim().length() > 0)
                    return child.getData();
             }
         }
         return "";
}
}


//Second File
package ReadXMLFile_part2;

import java.io.*;

import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.logging.Level;
import java.util.logging.Logger;

public class ReadXMLFile_part2 {

public static void main(String[] args) throws Exception {

PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/PA-PTH013_Output_Meters.xml"));
System.setOut(out);

System.out.println("*** JSOUP ***");

File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html");
Document doc = null;
    try {
        doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
    } catch (IOException ex) {
        Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
    }
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

Elements brackets = doc.getElementsByTag("bracket");

for (Element bracket : brackets) {
    Elements lis = bracket.select("li");

        for (Element li : lis){
        System.out.println(li.text());
        }
    break;
}
System.out.println();

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}

}

you can write a handler for the entire CDATA block and then do your own parsing on that, but the whole point of `CDATA` is that it's defined as plain character data that should not get parsed by an XML reader =) — Mike 'Pomax' Kamermans, Aug 12 '14 at 22:40
Thanks, but how do I parse through text with the same XML tag being the li elements? The closest I found was this: [link] http://stackoverflow.com/questions/12889253/how-to-parse-same-name-tag-in-xml-using-dom-parser-java — stitch70, Aug 13 '14 at 18:45
you already know how to get that CDATA as raw data, so the trick is to use a *second* parser (but then an HTML parser, not XML parser) and run your CDATA string through that. — Mike 'Pomax' Kamermans, Aug 13 '14 at 20:09
Thanks, I've ended up using a second parser (using HTML) as you said and has worked but ran into another issue. I've posted the full problem on another thread here: [link] (http://stackoverflow.com/questions/25491424/parsing-html-data-using-java-dom-parse) — stitch70, Aug 25 '14 at 17:42

score 3 · Answer 1 · answered Aug 12 '14 at 22:54

CDATA is a marker to XML interpreting engines, that whatever they encounter in between the start and end, should be treated as "pure" (raw) character data.

So, in a way, it's like an escape character for the parser (one that can encompass many characters).

Therefore, you won't find a XML parser that will report whatever is inside a CDATA as XML because the norm says that it MUST report it as a character stream. (As a consequence : it MUST NOT interpret it as XML stream, which is actually good because nothing mandates the content to be XML indeed).

Anyway, your parser and your code is working as expected.

But if, as in your case, you happen to know that the content of a certain CDATA instance is indeed a valid XML instance, then you can open a new Parser for this precise content, and deal with it appropriatly.

So you can get the output of your getCharacterDataFromElement(line) call, feed it to your documentBuilder, and use this new Documentinstance to parse the content of your li elements.

Thanks, but I've been going through most of StackOverFlow's questions but am unsure how to parse through a text with the SAME XML tag being the li elements. The closest that I found were these two: [link] http://stackoverflow.com/questions/12889253/how-to-parse-same-name-tag-in-xml-using-dom-parser-java [link] http://stackoverflow.com/questions/18391388/parsing-xml-with-tags-with-same-name-on-different-levels-in-dom — stitch70, Aug 13 '14 at 18:39
When You were looking for `placemark`, you wrote a "for" loop to find "all elements by (that) tag name". Well, You would do the same if you are looking for `li`. The difference being that placemark was matched only once, but li would be matched several. — GPI, Aug 14 '14 at 01:38

score 0 · Answer 2 · answered Aug 13 '14 at 06:59

0

Your question is something of a contradiction, since CDATA is an explicit instruction to the parser NOT to parse what it sees inside the CDATA. So the simplest way to get the content parsed is not to include the CDATA tags in the first place.

However, having told the parser not to parse the CDATA content, what you can do is extract the content as text, and then submit the text to the parser as a second parse operation.

answered Aug 13 '14 at 06:59

Michael Kay

156,231
11
92
164

Thanks, but how do I parse through a text with the same XML tag, being the li elements? The closest I found was this: [link] http://stackoverflow.com/questions/12889253/how-to-parse-same-name-tag-in-xml-using-dom-parser-java – stitch70 Aug 13 '14 at 18:46
I'm sorry, I don't understand your question. – Michael Kay Aug 13 '14 at 19:16
After extracting the CDATA contents as text as mentioned, how would I extract XML tags that have the same tag name (being
)? The example is on line 5 of the first code snippet above - thks

stitch70

Aug 13 '14 at 19:53

How to parse within CDATA in XML using Java

Sep 2, 2014 update

2 Answers2