I have the following SOAP XML from which I want to extract the text content of all nodes:
<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"
xmlns:m="http://www.example.org/stock">
<soap:Body>
<m:GetStockName>
<m:StockName>ABC</m:StockName>
</m:GetStockName>
<!--some comment-->
<m:GetStockPrice>
<m:StockPrice>10 \n </m:StockPrice>
<m:StockPrice>\t20</m:StockPrice>
</m:GetStockPrice>
</soap:Body>
</soap:Envelope>
The exptected output would be:
'ABC10 \n \t20'
I've done the following in DOM:
public static String parseXmlDom() throws ParserConfigurationException,
SAXException, IOException, FileNotFoundException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
// Read XML File
String xml = IOUtils.toString(new FileInputStream(new File(
"./files/request2.xml")), "UTF-8");
InputSource is = new InputSource(new StringReader(xml));
// Parse XML String to DOM
factory.setNamespaceAware(true);
factory.setIgnoringComments(true);
Document doc = builder.parse(is);
// Extract nodes text
NodeList nodeList = doc.getElementsByTagNameNS("*", "*");
Node node = nodeList.item(0);
return node.getTextContent();
}
And with SAX:
public static String parseXmlSax() throws SAXException, IOException, ParserConfigurationException {
final StringBuffer sb = new StringBuffer();
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
// Declare Handler
DefaultHandler handler = new DefaultHandler() {
public void characters(char ch[], int start, int length) throws SAXException {
sb.append((new String(ch, start, length)));
}
};
// Parse XML
saxParser.parse("./files/request2.xml", handler);
return sb.toString();
}
For both aproaches I receive:
'
ABC
10 \n
\t20
'
I know I could easily put return sb.toString().replaceAll("\n", "").replaceAll("\t", "");
to achieve the expected result, but if my XML file is badly formatted, with extra spaces for example, the result would include that extra spaces too.
Also, I've tried this approach to read the XML as a single line before parsing it with SAX or DOM, but it does not work for my SOAP XML example as it trims the spaces between the soap:Envelope
properties when there are breaklines (xmlns:m
):
<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"xmlns:m="http://www.example.org/stock"><soap:Body><m:GetStockName><m:StockName>ABC</m:StockName></m:GetStockName><m:GetStockPrice><m:StockPrice>10 \n </m:StockPrice><m:StockPrice>\t20</m:StockPrice></m:GetStockPrice></soap:Body></soap:Envelope>
[Fatal Error] :1:129: Element type "soap:Envelope" must be followed by either attribute specifications, ">" or "/>".
How can I read just the text content of all nodes in a SOAP XML no matter if the XML file consists in a single line or multiple well/badly formatted lines (ignoring comments too)?