0
<br>Aggie<br><br>John<br><p>Hello world</p><br>Mary<br><br><b>Peter</b><br>

I'd like to create a regexp that safely matches these:

<br>Aggie<br>
<br>John<br>
<br>Mary<br>
<br><b>Peter</b><br>

This is possible that there are other tags (e.g. <i>,<strike>...etc ) between each pair of <br> and they have to be collected just like the <br><b>Peter</b><br>

How should the regexp look like?

bobo
  • 8,439
  • 11
  • 57
  • 81
  • 7
    http://www.codinghorror.com/blog/archives/001311.html *sigh* – Joey Nov 19 '09 at 15:21
  • I understand it's sometimes better to do this using an HTML parser. But this is actually just a made-up example that I want to see what syntax it would be if it really has to be done in regex. – bobo Nov 19 '09 at 17:09

3 Answers3

6

If you learn one thing on SO, let it be - "Do not parse HTML with a regex". Use an HTML Parser

RC.
  • 27,409
  • 9
  • 73
  • 93
  • To anyone pointing automatically to this one, quoting from the very same blog post: "Many programs will neither need to, nor should, anticipate the entire universe of HTML when parsing." It's absolutely OK to parse a HTML-like input if you keep this in mind. – candiru Nov 19 '09 at 15:28
  • 2
    This question is missing the obligatory bobince reference: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – intgr Nov 19 '09 at 15:29
  • @candiru: The asker explicitly asked for a regexp that is **"safe"**. Regexps are fine for one-off hacks, but they are certainly not safe. – intgr Nov 19 '09 at 15:30
  • 1
    intgr: It's linked from Jeff's post I linked in the comment to the question. It's just another pointer to dereference :-) – Joey Nov 19 '09 at 15:44
1
<br>.*?<br>

will match anything from one <br> tag to the closest following one.

The main problem with parsing HTML using regexes is that regexes can't handle arbitrarily nested structures. This is not a problem in your example.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
0

Split the string at (<br>)+. You'll get empty strings at the beginning and the end of the result, so you need to remove them, too.

If you want to preserve the <br>, then this is not possible unless you know that there is one before and after each element in the result.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • You can still pre- and append an `
    ` to each result, though. Not nice but if the OP *requires* the `
    ` ...
    – Joey Nov 19 '09 at 16:03