1

I am trying to replace two or more occurences of <br/> (like <br/><br/><br/>) tags together with two <br/><br/> with the following pattern

Pattern brTagPattern = Pattern.compile("(<\\s*br\\s*/\\s*>\\s*){2,}", 
     Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

But there are some cases where '<br/> <br/>' tags come with a space and they get replaced with 4 <br/> tags which was actually supposed to be replaced with just 2 tags.

What can i do to ignore 2 or 3(few) spaces that come in between the tags ?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Arun Abraham
  • 4,011
  • 14
  • 54
  • 75
  • 4
    This regex (even though it's being used to parse HTML ``) should work as is. There seems to be a different problem. Can you provide more context? – Tim Pietzcker Oct 06 '10 at 13:06
  • 2
    Probably not the answer you want to hear, but it is general wisdom that you should *not* attempt to parse XML/HTML with regular expressions. So many things can go wrong -- it's a much better idea to use a parsing library specifically meant for such data, which will also completely bypass the issue you're having. – Adrian Petrescu Oct 06 '10 at 13:07
  • @Adrian: could you give me an example? – Arun Abraham Oct 06 '10 at 13:48
  • 1
    @Arun: Sure :) Take a look at JAXB (http://www.oracle.com/technetwork/articles/javase/index-140168.html) if you are certain your HTML is well-formed XML, or if the HTML is likely to be messy and incompliant (like most real-world HTML) you should try something like TagSoup (http://mercury.ccil.org/~cowan/XML/tagsoup/) – Adrian Petrescu Oct 06 '10 at 13:50
  • I've converted my comments into an answer, since they've kind of turned into one :) – Adrian Petrescu Oct 06 '10 at 13:51
  • @Tim Pietzcker: thanks buddy...guess it was some other prob...i will have to figure it out...the regex is working fine... – Arun Abraham Oct 06 '10 at 15:09

3 Answers3

1

Here's some Groovy code to test your Pattern:

import java.util.regex.*

Pattern brTagPattern = Pattern.compile( "(<\\s*br\\s*/\\s*>\\s*){2,}", Pattern.CASE_INSENSITIVE | Pattern.DOTALL )
def testData = [
  ['',                            ''],
  ['<br/>',                       '<br/>'],
  ['< br/> <br />',               '<br/><br/>'],
  ['<br/> <br/><br/>',            '<br/><br/>'],
  ['<br/>   < br/ > <br/>',       '<br/><br/>'],
  ['<br/> <br/>   <br/>',         '<br/><br/>'],
  ['<br/><br/><br/> <br/><br/>',  '<br/><br/>'],
  ['<br/><br/><br/><b>w</b><br/>','<br/><br/><b>w</b><br/>'],
 ]

testData.each { inputStr, expected ->
  Matcher matcher = brTagPattern.matcher( inputStr )
  assert expected == matcher.replaceAll( '<br/><br/>' )
}

And everything seems to pass fine...

tim_yates
  • 167,322
  • 27
  • 342
  • 338
  • thanks buddy...it was just an issue raised to me by one of my colleagues..i thght that it was a valid issue...guess something else was causing the issue... – Arun Abraham Oct 06 '10 at 15:04
  • Your code won't work with the `


    hello`, you will return `

    hello` instead of `

    hello`. The question request to ignore *only* the spaces between the
    tags.
    – greuze Oct 08 '10 at 08:35
1

Probably not the answer you want to hear, but it is general wisdom that you should not attempt to parse XML/HTML with regular expressions. So many things can go wrong -- it's a much better idea to use a parsing library specifically meant for such data, which will also completely bypass the issue you're having.

Take a look at JAXB if you are certain your HTML is well-formed XML, or if the HTML is likely to be messy and incompliant (like most real-world HTML) you should try something like TagSoup.

Adrian Petrescu
  • 16,629
  • 6
  • 56
  • 82
  • +1, and requisite link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Kirk Woll Oct 06 '10 at 14:02
0

You can do that changing a little your regex:

Pattern brTagPattern = Pattern.compile("<\\s*br\\s*/\\s*>\\s*<\\s*br\\s*/\\s*>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

This will ignore every spaces between two
. If you just want exactly 2 or three, you can use:

Pattern brTagPattern = Pattern.compile("<\\s*br\\s*/\\s*>(\\s){2,3}<\\s*br\\s*/\\s*>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
greuze
  • 4,250
  • 5
  • 43
  • 62