0

I am trying to parse a XML file with HTML strings using DOMParser. The problem is that the getTextContent() method gets only the texts but not any HTML tags in it. I expect the string to be returned as it is rather than the parsed version. I searched the whole web and I couldn’t find anything that helps me. Btw. I cannot make any changes to the HTML strings since there are more than 100k stings spanning across around 500 files.

Test.xml file

<?xml version="1.0" encoding="iso-8859-1"?>
<UserDetails xml:lang="en">
    <UserMessage ID="TestID">Text goes here. <span style="color:#DF0000"><b>Bold Text goes here.</b> </span>More Text.</UserMessage>
</UserDetails>

Java module

import com.sun.org.apache.xerces.internal.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

public class TestAll
{
    public static void main(String[] args)
    {
        try
        {
            File file = new File("C:/Users/Administrator/Desktop/Test.xml");

            DOMParser fileParser = new DOMParser();
            InputStream in = new FileInputStream(file);
            InputSource source = new InputSource(in);
            fileParser.parse(source);
            in.close();
            Document newFileDoc = fileParser.getDocument();
            NodeList nodes = newFileDoc.getChildNodes();
            for (int i = 0; i < nodes.getLength(); i++)
            {
                Node node = nodes.item(i);
                NodeList userMessages = node.getChildNodes();
                for (int j = 0; j < userMessages.getLength(); j++)
                {
                    Node userMessage = userMessages.item(j);
                    if (userMessage.getNodeType() == Node.ELEMENT_NODE)
                    {
                        String text = userMessage.getTextContent();
                        System.out.println(text);
                    }
                }
            }
        }
        catch (Exception e)
        {
            e.printStackTrace(); 
        }
    }

}

Actual Output

Text goes here. Bold Text goes here. More Text.

Expected Output

Text goes here. <span style="color:#DF0000"><b>Bold Text goes here.</b> </span>More Text.

Any help would be appreciated.

user864309
  • 226
  • 2
  • 11

2 Answers2

0

Try to put the text between

<xmp> </xmp> 

tags, everything in between will be displayed as is

DuckStalker
  • 58
  • 1
  • 8
0

Your userMessage variable is a DOM node.

If you want to convert the DOM node to an HTML string, look here:

How do I convert a org.w3c.dom.Document object to a String?

Community
  • 1
  • 1
g00dnatur3
  • 1,173
  • 9
  • 16