0

Morning, How can you get only txt from this xml item(description) for example?

<description><![CDATA[<b>
<font color="#000000">hello world...</font>
</b>]]></description>

my code now is

if (cureent.getNodeName().equalsIgnoreCase("description")){
item.setDescription(cureent.getTextContent());

and result printed is:

<![CDATA[<b><font color="#000000">hello world...</font></b>]]>

this is what I need out print :

hello world...

Thanks All

Mark
  • 11
  • 3

4 Answers4

0

There may be a parser you could use for that, but I think a simple regex should get the job done:

String textContent = cureent.getTextContent();
String stripped = textContent.replaceAll("^<!\\[CDATA\\[|\\]\\]>$|<[^>]*>","");
item.setDescription(stripped);

Here is a breakdown of the pattern used above:

            "^<!\\[CDATA\\[" // find "<![CDATA[" at beginning
            +"|"             // or 
            +"\\]\\]>$"      // find "]]>" at ending
            +"|"             // or
            +"<[^>]*>"      // every tag from "<" up to ">" 

Of course, as commenter reminds us, the above simple regexp will fail if you have nested tags, i.e. an ">" appears somewhere which is not actually closing the tag. If that type of data is a possibility, better use a real parser, e.g. Jsoup.

Patrick Parker
  • 4,863
  • 4
  • 19
  • 51
  • @Mark [Here](http://stackoverflow.com/help/someone-answers) is what to do when someone answers your question. To mark an answer as accepted, click on the check mark beside the answer to toggle it from greyed out to filled in. There is no need to add a comment on your question or on an answer to say "Thank you". – Patrick Parker Feb 07 '17 at 12:51
  • 1
    Please, read http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex – Absolut Feb 07 '17 at 12:52
0

Since your input file is not a well formed XML so we cannot use the DocumentBuilder class to parse it as an XML. Thus, we need to hack it by processing it as a plain text file. Here's what I have tried:

    BufferedReader br = null;
    FileReader fr = null;

    try {

        fr = new FileReader("D:\\workspace\\Test\\Trial.xml"); // Put your text here
        br = new BufferedReader(fr);

        String sCurrentLine;
        StringBuffer totalString = new StringBuffer();

        while ((sCurrentLine = br.readLine()) != null) {
            totalString.append(sCurrentLine);
        }

        String condensedString = totalString.substring(totalString.indexOf("<font color="),
                totalString.indexOf("</font>"));

        String moreCondensedString = condensedString.replaceAll("[0-9]", "").replaceAll("#", "");
        System.out.println(moreCondensedString.substring(moreCondensedString.indexOf('>') + 1));
    } catch (IOException e) {

        e.printStackTrace();

    } 

Here I have first condensed your string by cutting it from the <font color= and the </font> tag.

Then I replaced all the numbers and special characters

Then I have condensed the string again by cutting it from '>'

Hope it helps!

0

try this

if (cureent.getNodeName().equalsIgnoreCase("description")){
item.setDescription(cureent.getTextContent().replaceAll("<.*?>", ""););
maksoud
  • 135
  • 5
0

I came up with a solution using Jsoup and it works for your example input. Testing with wide range of inputs is recommended though.

public static void main(String[] args) throws Exception {
    String xml = "<description><![CDATA[<b>\r\n" + 
            "<font color=\"#000000\">hello world...</font>\r\n" + 
            "</b>]]></description>";
    Document d = Jsoup.parse(xml);
    String text = extractText(d.getElementsByTag("description").get(0).text());
    System.out.println(text);

}

private static String extractText(String xml) {
    Document d = Jsoup.parse(xml);
    d = Jsoup.parse(xml);
    if(!xml.equals(d.text())){
        return extractText(d.text());
    }
    return d.text();
}
Pavan Kumar
  • 4,182
  • 1
  • 30
  • 45