Get only txt from an XML document from java?

Question

Morning, How can you get only txt from this xml item(description) for example?

<description><![CDATA[<b>
<font color="#000000">hello world...</font>
</b>]]></description>

my code now is

if (cureent.getNodeName().equalsIgnoreCase("description")){
item.setDescription(cureent.getTextContent());

and result printed is:

<![CDATA[<b><font color="#000000">hello world...</font></b>]]>

this is what I need out print :

hello world...

Thanks All

Patrick Parker · Answer 1 · 2017-02-07T13:27:56.363

0

There may be a parser you could use for that, but I think a simple regex should get the job done:

String textContent = cureent.getTextContent();
String stripped = textContent.replaceAll("^<!\\[CDATA\\[|\\]\\]>$|<[^>]*>","");
item.setDescription(stripped);

Here is a breakdown of the pattern used above:

            "^<!\\[CDATA\\[" // find "<![CDATA[" at beginning
            +"|"             // or 
            +"\\]\\]>$"      // find "]]>" at ending
            +"|"             // or
            +"<[^>]*>"      // every tag from "<" up to ">"

Of course, as commenter reminds us, the above simple regexp will fail if you have nested tags, i.e. an ">" appears somewhere which is not actually closing the tag. If that type of data is a possibility, better use a real parser, e.g. Jsoup.

edited Feb 07 '17 at 13:27

answered Feb 07 '17 at 11:58

Patrick Parker

4,863
4
19
51

@Mark [Here](http://stackoverflow.com/help/someone-answers) is what to do when someone answers your question. To mark an answer as accepted, click on the check mark beside the answer to toggle it from greyed out to filled in. There is no need to add a comment on your question or on an answer to say "Thank you". – Patrick Parker Feb 07 '17 at 12:51
1

Please, read http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex – Absolut Feb 07 '17 at 12:52

score 0 · Answer 2 · answered Feb 07 '17 at 12:18

Since your input file is not a well formed XML so we cannot use the DocumentBuilder class to parse it as an XML. Thus, we need to hack it by processing it as a plain text file. Here's what I have tried:

    BufferedReader br = null;
    FileReader fr = null;

    try {

        fr = new FileReader("D:\\workspace\\Test\\Trial.xml"); // Put your text here
        br = new BufferedReader(fr);

        String sCurrentLine;
        StringBuffer totalString = new StringBuffer();

        while ((sCurrentLine = br.readLine()) != null) {
            totalString.append(sCurrentLine);
        }

        String condensedString = totalString.substring(totalString.indexOf("<font color="),
                totalString.indexOf("</font>"));

        String moreCondensedString = condensedString.replaceAll("[0-9]", "").replaceAll("#", "");
        System.out.println(moreCondensedString.substring(moreCondensedString.indexOf('>') + 1));
    } catch (IOException e) {

        e.printStackTrace();

    }

Here I have first condensed your string by cutting it from the <font color= and the </font> tag.

Then I replaced all the numbers and special characters

Then I have condensed the string again by cutting it from '>'

Hope it helps!

score 0 · Answer 3 · answered Feb 07 '17 at 12:27

0

try this

if (cureent.getNodeName().equalsIgnoreCase("description")){
item.setDescription(cureent.getTextContent().replaceAll("<.*?>", ""););

answered Feb 07 '17 at 12:27

maksoud

135
5

score 0 · Answer 4 · answered Feb 07 '17 at 13:01

I came up with a solution using Jsoup and it works for your example input. Testing with wide range of inputs is recommended though.

public static void main(String[] args) throws Exception {
    String xml = "<description><![CDATA[<b>\r\n" + 
            "<font color=\"#000000\">hello world...</font>\r\n" + 
            "</b>]]></description>";
    Document d = Jsoup.parse(xml);
    String text = extractText(d.getElementsByTag("description").get(0).text());
    System.out.println(text);

}

private static String extractText(String xml) {
    Document d = Jsoup.parse(xml);
    d = Jsoup.parse(xml);
    if(!xml.equals(d.text())){
        return extractText(d.text());
    }
    return d.text();
}

Get only txt from an XML document from java?

4 Answers4