0

Here is a simple function I wrote to get the value from a tag.

public static String getTagAValue(String xmlAsString) {
    Pattern pattern = Pattern.compile("<TagA>(.+)</TagA>");
    Matcher matcher = pattern.matcher(xmlAsString);
    if (matcher.find()) {
        return matcher.group(1);
    } else {
        return null;
    }
}

It is not finding a match and returning null.

XML Sample

<xml>
    <sample>
        <TagA>result</TagA>
    </sample>
</xml>

Note, here I used 4 spaces for tabs, but the real string would contain tabs.

GC_
  • 448
  • 4
  • 23

2 Answers2

3

Don't use regular expressions to parse XML: it's the wrong tool for the job.

Classic answer here: RegEx match open tags except XHTML self-contained tags

The answer you have accepted gives wrong answers, for example:

  • It doesn't accept whitespace in places where whitespace is allowed, such as before ">"

  • It will match a commented-out element, or one that appears in a CDATA section

  • It does a greedy match, so it will find the LAST matching end tag, not the first one.

However hard you try, you will never get it 100% right.

And in case you care more about performance than correctness, it's also grossly inefficient because of the need for backtracking.

To do the job properly and professionally, use an XML parser.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thanks for your concern, but the XML format is controlled and regular. Honestly, I am not a fan of RegEx, Java needs some better search language. – GC_ Sep 11 '20 at 16:19
  • I just mention it because we get an awful lot of posts on StackOverflow from people saying "I need to generate XML in precisely this controlled format because the receiving application can't handle anything else". And at that point you're losing all the benefits of using a standardised format for data interchange. (I bet you're not even documenting precisely the constraints that the generating application has to be aware of). – Michael Kay Sep 12 '20 at 07:03
2

You probably want to enable that the RegExp works on multi-line:

Pattern.compile("<TagA>(.+)</TagA>", Pattern.DOTALL);

Documentation explains the parameter Pattern.DOTALL:

Enables dotall mode. In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Edit: While this works in this particular case, please everyone refer to the answert of Michael Kay if you want to solve such problems professionally, efficiently and right.

jmizv
  • 1,172
  • 2
  • 11
  • 28
  • No luck, didn't work... Its on windows if that matters, so it is window line ends. – GC_ Sep 10 '20 at 18:01
  • The flag `DOTALL` won't have any effect as your input target is in one line. I tried this and got a `true`: `Pattern compile = Pattern.compile("(.+)", Pattern.DOTALL); Matcher matcher = compile.matcher(" result\r\n "); boolean find = matcher.find();` – jmizv Sep 10 '20 at 18:04
  • Maybe it is because my spring has more than one line. My result is not on the first line. – GC_ Sep 10 '20 at 18:06
  • This doesn't make any difference. Can you debug and check if the input string `xmlAsString` is really what yout think it is? – jmizv Sep 10 '20 at 18:07
  • Downvoting because this answer is wrong, for reasons explained in my answer. It might give the right answer on this one test case, but passing one test case doesn't make it correct. – Michael Kay Sep 11 '20 at 08:01