0

I've got a problem finding empty HTML elements in a multiline HTML file. My regexp is this:

Pattern pattern = Pattern.compile("<([a-zA-Z][a-zA-Z0-9]*)[^>]*?>[\\s]*?</\\1>");
Matcher matcher = pattern.matcher(htmlOut);
while (matcher.find())
{
    htmlOut = matcher.replaceAll("");
    matcher = pattern.matcher(htmlOut);
}

The problem is it doesn't match any of the empty tags.

FYI: The same regexp <([a-zA-Z][a-zA-Z0-9]*)[^>]*?>[\s]*?</\1> works in sublime text!

Any approach?

kernel
  • 3,654
  • 3
  • 25
  • 33
  • Obligatory Regex/Html reply: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – glenatron Jul 24 '12 at 11:07
  • 3
    @glenatron: Removing empty tags is well in the realm of what's possible with regex. Even with HTML. – Tim Pietzcker Jul 24 '12 at 11:26

1 Answers1

3

The pattern is OK, but you're using it wrong. replaceAll() is called on the string, not on the matcher object.

Also, no need to iterate over the matches - one replaceAll is enough:

htmlOut = htmlOut.replaceAll("<([a-zA-Z][a-zA-Z0-9]*)[^>]*>\\s*</\\1>", "");

You don't need lazy quantifiers, though - but that wouldn't affect the match results.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thanks for your suggestion Tim! Unfortunately this doesn't work either. It simply doesn't find a single empty tag. None. I don't understand why. I already tried just matcher.find()-ing it but this also returns false. – kernel Jul 24 '12 at 10:46
  • It works for me - the regex does not allow any whitespace before the tag name and after the closing tag name, though. I don't suppose that's the problem? And you did replace the entire code you posted by my single line, right? – Tim Pietzcker Jul 24 '12 at 10:48
  • You mean if I have a tidy'd well indented `` it wouldn't match its empty `
    `s because of the preceding whitespace? Edit: I tried it with `htmlOut = htmlOut.replaceAll("\\s*<([a-zA-Z][a-zA-Z0-9]*)[^>]*>\\s*\\1>\\s*", "");` it doesn't work either. Yes I replace my bunch of code with yours ;)
    – kernel Jul 24 '12 at 10:51
  • No, I meant tags like `< a > a >` etc., but that doesn't seem to be it. – Tim Pietzcker Jul 24 '12 at 10:58
  • Ah alright ;) No. As I said: The same expression works fine in Sublime Text - with the same input HTML. The problem must be in Java itself (do any regexp flags need to be considered?) – kernel Jul 24 '12 at 11:02
  • No special flags in use here. What encoding is the string you're matching? Not UTF-16 by any chance? – Tim Pietzcker Jul 24 '12 at 11:05
  • Input and Output Encoding of html tidy is set to UTF-8. I'm touching the html right after tidying it. – kernel Jul 24 '12 at 11:11
  • Alright, I isolated the problem based on your encoding advise. It seems that there must be some kinda hidden character inbetween. I'll figure that out! Thank you very much! – kernel Jul 24 '12 at 11:19