1

I have the following SOAP XML from which I want to extract the text content of all nodes:

<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"
    xmlns:m="http://www.example.org/stock">
    <soap:Body>
        <m:GetStockName>
            <m:StockName>ABC</m:StockName>
        </m:GetStockName>
        <!--some comment-->
        <m:GetStockPrice>
            <m:StockPrice>10 \n </m:StockPrice>
            <m:StockPrice>\t20</m:StockPrice>
        </m:GetStockPrice>
    </soap:Body>
</soap:Envelope>

The exptected output would be:

'ABC10 \n \t20'

I've done the following in DOM:

public static String parseXmlDom() throws ParserConfigurationException,
        SAXException, IOException, FileNotFoundException {

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    // Read XML File
    String xml = IOUtils.toString(new FileInputStream(new File(
            "./files/request2.xml")), "UTF-8");
    InputSource is = new InputSource(new StringReader(xml));
    // Parse XML String to DOM
    factory.setNamespaceAware(true);
    factory.setIgnoringComments(true);
    Document doc = builder.parse(is);
    // Extract nodes text
    NodeList nodeList = doc.getElementsByTagNameNS("*", "*");
    Node node = nodeList.item(0);
    return node.getTextContent();
}

And with SAX:

public static String parseXmlSax() throws SAXException, IOException, ParserConfigurationException {

    final StringBuffer sb = new StringBuffer();
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();
    // Declare Handler
    DefaultHandler handler = new DefaultHandler() {
        public void characters(char ch[], int start, int length) throws SAXException {
            sb.append((new String(ch, start, length)));
        }
    };
    // Parse XML
    saxParser.parse("./files/request2.xml", handler);
    return sb.toString();
}

For both aproaches I receive:

'


            ABC



            10 \n 
            \t20


'

I know I could easily put return sb.toString().replaceAll("\n", "").replaceAll("\t", ""); to achieve the expected result, but if my XML file is badly formatted, with extra spaces for example, the result would include that extra spaces too.

Also, I've tried this approach to read the XML as a single line before parsing it with SAX or DOM, but it does not work for my SOAP XML example as it trims the spaces between the soap:Envelope properties when there are breaklines (xmlns:m):

<soap:Envelope xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"xmlns:m="http://www.example.org/stock"><soap:Body><m:GetStockName><m:StockName>ABC</m:StockName></m:GetStockName><m:GetStockPrice><m:StockPrice>10 \n  </m:StockPrice><m:StockPrice>\t20</m:StockPrice></m:GetStockPrice></soap:Body></soap:Envelope>
[Fatal Error] :1:129: Element type "soap:Envelope" must be followed by either attribute specifications, ">" or "/>".

How can I read just the text content of all nodes in a SOAP XML no matter if the XML file consists in a single line or multiple well/badly formatted lines (ignoring comments too)?

Community
  • 1
  • 1
João Pereira
  • 561
  • 4
  • 18

0 Answers0