1

I want to find xml tags of type x in a text that

  • are empty (contains only spaces)
  • may or may not have attributes

Also something like this

<x>  </x>
<x a="v">  </x>

I use following regular expression in combination with the Matcher find function.

<x.*?> +</x>

I get matches that I don't expect. See following test case

@Test
public void sample() throws Exception
{
    String text = "Lorem <x>ipsum <x>dolor sit amet</x> </x>";
    String regex = "<x.*?> +</x>";

    Matcher matcher = Pattern.compile(regex).matcher(text);
    assertFalse(matcher.find());
}

The test fails. Instead this is true

assertTrue(matcher.find());
assertEquals("<x>ipsum <x>dolor sit amet</x> </x>", matcher.group());

Does the find function not support the non-greedy operator or what goes wrong here?

PS I know that there is a plethora of different ways to process xml data. But this is not the point here.

mkdev
  • 972
  • 9
  • 12
  • 1
    [One good reason not to venture down this road](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Reimeus Aug 13 '13 at 14:02

1 Answers1

5

The .*? quantifier means that it will find as few characters as possible to satisfy the match, it doesn't mean that it will stop searching at the first > it finds. So in your example, the <x.*?> will match all of:

<x>ipsum <x>dolor sit amet</x>

With all the characters between the first x and the the final > satisfying the .*?. To fix this, you can simply change your pattern to:

<x[^>]*> +</x>

On a side note, it's been stated many times before, but you should not use regular expressions to parse xml/html/xhtml.

Community
  • 1
  • 1
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • Thanks for the answer and the hint. I know that regex is not the right tool for xml processing but sometimes - if you have to make a litte fix to thausends of files - a quick text replace is tempting. – mkdev Aug 13 '13 at 14:38