5

I'm trying to replace any <br /> tags that appear AFTER a </h2> tag. This is what I have so far:

Text = Text.replace(new RegExp("</h2>(\<br \/\>.+)(.+?)", "g"), '</h2>$2');

It doesn't seem to work, can anyone help? (No matches are being found).

Test case:

<h2>Testing</h2><br /><br /><br />Text

To:

<h2>Testing</h2>Text
Tom Gullen
  • 61,249
  • 84
  • 283
  • 456
  • 4
    Oh lord, please just use a parser. – Matt Ball Apr 28 '11 at 22:37
  • 1
    It's like you're begging me to post a link to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Gabe Moothart Apr 28 '11 at 22:41
  • @Gabe, I don't see how, this is for a WYSIWYG editor I'm writing, it turns `\n` into `
    ` and `##title##` into `

    Title

    ` but now I just want to remove all trailing `
    ` after the `h2` or it looks bad.
    – Tom Gullen Apr 28 '11 at 22:43
  • 1
    Use a parser library if available. You would even be better off just writing a quick and simple character-by-character parser. It would actually be less work, more satisfying, easier to understand and less error-prone than regex. And you can add more features easily when you need to. My rule of thumb is regular _expression_: it's only one or two levels up from tokens. You could use regex to validate a single HTML element or a text node. I would consider that expression-level. But not structured HTML. No doubt someone will come up with a very clever regex which solves your problem. – rohannes Apr 28 '11 at 22:51
  • @Rohannes, I think regexp is better, because once the form is submitted the data has to be processed server side to produce the same output, so maintaing regexp is easier this way. – Tom Gullen Apr 28 '11 at 23:32

4 Answers4

16

This is simpler than you're thinking it out to be:

Text = Text.replace(new RegExp("</h2>(\<br \/\>)*", "g"), "</h2>");
mVChr
  • 49,587
  • 11
  • 107
  • 104
5

This would do what you are asking:

Text = Text.replace(new RegExp("</h2>(<br />)*", "g"), '</h2>');
serby
  • 4,186
  • 2
  • 24
  • 25
5

If you have jQuery kicking around then you can do this safely without regular expressions:

var $dirty = $('<div>').append('<p>Where is<br>pancakes</p><h2>house?</h2><br><br>');
$dirty.find('h2 ~ br').remove();
var clean = $dirty.html();
// clean is now "<p>Where is<br>pancakes</p><h2>house?</h2>"

This will also insulate against the differences between <br>, <br/>, <br />, <BR>, etc.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • Thanks, I think going regexp is better because I have to duplicate all these rules serverside in c# when the form is actually submitted. – Tom Gullen Apr 28 '11 at 23:31
  • @Tom: I'd recommend that you use an HTML parser (with both element and attribute whitelisting) on the server side too, you should fully scrub everything that comes from the client even if you're doing client side scrubbing and even if you fully trust your users. OTOH, this is your project, not mine :) – mu is too short Apr 29 '11 at 04:49
3

You can also make this a little nicer? using the shorthand regex syntax

Text = Text.replace(/<\/h2>(<br\s*\/>)*/g, '</h2>');
serby
  • 4,186
  • 2
  • 24
  • 25
  • 2
    I'd change the `*` to a `+`. Otherwise, you are unnecessarily replacing `` with `` when there are zero `
    ` tags.
    – ridgerunner Apr 28 '11 at 23:23