Using java Pattern and Matcher, how to get first matching tag content

Question

I have at SoapMessage that looks like this:

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Header>
    <Action xmlns="http://www.w3.org/2005/08/addressing">http://service.xxx.dk/DialogModtag</Action>
    <MessageID xmlns="http://www.w3.org/2005/08/addressing">urn:uuid:382b4943-26e8-4698-a275-c3149d2d889e</MessageID>
    <To xmlns="http://www.w3.org/2005/08/addressing">http://xxx.dk/12345678</To>
    <RelatesTo xmlns="http://www.w3.org/2005/08/addressing">uuid:cb2320dc-c8ab-4880-94cb-2ab68129216f</RelatesTo>
</soap:Header>
<soap:Body xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="id-2515">
    Some content ...
</soap:Body>

and I am trying to extract the contents of the <Action> tag within the <Header> tag using code like this:

Pattern PATTERN_SOAP_ACTION = 
    Pattern.compile(".*Header.*Action.*>(.*)<.*Action.*Header.*", Pattern.DOTALL);

String text = readFile("c:\\temp\\DialogUdenBilag.xml");
Matcher matcherSoapAction = PATTERN_SOAP_ACTION.matcher(text);
if (matcherSoapAction.matches()) { System.out.println(matcherSoapAction.group(1)); }
else { System.out.println("SaopAction not found"); }

This seems to be working OK for small soap messages. But when the soap:Body grows to +1MB, then the matches() function call takes minutes to complete.

Any ideas for making my regex pattern more CPU friendly?

Possible answer: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Andrey Chaschev, Nov 25 '13 at 17:17

Stephan · Answer 1 · 2013-11-25T19:59:54.217

Solution

You want to use an XML parser for a more CPU friendly solution.

 XMLInputFactory factory = XMLInputFactory.newInstance();
 XMLStreamReader reader = factory.createXMLStreamReader(new FileInputStream("c:\\temp\\DialogUdenBilag.xml"));

 boolean found=false;
 boolean inHeader=false;
 String actionContent = "";

 while(!found && reader.hasNext()){
    if(reader.next() == XMLStreamConstants.START_ELEMENT) {
        String localName=reader.getLocalName());

        if ("Header".equalsIgnoreCase(localName) {
            inHeader = true;
        }

        if(inHeader && "Action".equalsIgnoreCase(localName) {

            int evt=reader.next();
            do {
               if (evt==XMLStreamConstants.CHARACTERS) {
                   actionContent = reader.getText().trim();
                   found=true;
                   break;
               }

               evt=reader.next();
            } while(evt != XMLStreamConstants.END_ELEMENT);

        }
    }
 }

 if (found) {
     System.out.println(actionContent);
 } else {
     System.out.println("SaopAction not found");
 }

Discussion

This little snippet is a little bit lengthy but you'll get your answer without looking inside the whole XML code. In fact, the snippet stops when it find the soap:Action tag and then returns the text content of this tag.

You are absolutely right! I want to use your solution, which even also works much quicker for small soap actions. Thank you very much for your help. — user3033204, Nov 26 '13 at 07:11

score 1 · Answer 2 · edited May 23 '17 at 11:56

Using regular expressions to parse XML is evil, and may incur the Wrath of the One whose Name cannot be expressed in the Basic Multilingual Plane. If you need to parse XML, use an actual XML parser - that's what it's there for. And situations like this are what XPath expressions are for, too:

javax.xml.xpath.XPath xpath = javax.xml.xpath.XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContextMap(
    "s", "http://schemas.xmlsoap.org/soap/envelope/",
    "a", "http://www.w3.org/2005/08/addressing"));
javax.xml.xpath.XPathExpression expression = xpath.compile("//s:Header/a:Action");
String result = expression.evaluate(new org.xml.sax.InputSource(new FileReader("c:\\temp\\DialogUdenBilag.xml")));

(Note that NamespaceContextMap isn't a standard class - see here for the implementation.)

As for your regexp: it's written to unnecessarily match the entire input string, and doing lots of maximal rather than minimal matching. You'd chew through a lot less CPU if you had an expression more tightly focused on just the relevant bit of the document (for example, "<((?:\\w+:)?)?Header\\b[^>]*>.*?<((?:\\w+:)?)Action\\b[^>]*>(.*?)</\\2Action>.*?</\\1Header>"), and called Matcher.find() to do a substring match. That said, parsing XML with a regexp is bad practice - you really should be using an XML parser instead!

Using java Pattern and Matcher, how to get first matching tag content

2 Answers2

Solution

Discussion