1

I'm trying to use regex to clean up some code generated in my own html5 RTE. Searching around i see a lot of people saying regex should not be used to parse html... but i'm doing this clientside with JavaScript. Do i have any other option than regex?

I have been trying to use lookbehinds (just found out about them) but they dont seem to work with JavaScript. What i want to do is delete all <br> at the very end of <p>'s, but not those that are the only element in the paragraph, like <p><br></p>. So:

<p>Blah<br><br><br></p> becomes <p>Blah</p>
<p><br></p> stays the same.

So far i only have

html = html.replace(/(?:<br\s?\/?>)+(<\/p>)/g, '$1');

Which will delete all <br>'s at the end of a paragraph, no matter how many.

I would like something like

html = html.replace(/(?<!<p>)(?:<br\s?\/?>)+(<\/p>)/g, '$1');

EDIT: i'm using a contenteditable div to create a very simple RTE that is dynamically created everytime a user wants to change some text. basically just clearing reduntant span, br, and p tags, and such.

iOfWhy
  • 23
  • 4
  • 1
    Yes you have other options! Use the DOM, use jQuery, use [htmlparser.js](http://ejohn.org/blog/pure-javascript-html-parser/) if you must! Don't even mention regex, or people will post tangential links about `Tony the Pony`. – RichardTowers Jan 13 '13 at 22:39
  • @RichardTowers So far using regex has been really easy and fast... what are the reasons for not using it? looking around most people just seem religious about the topic without giving any real reasons... – iOfWhy Jan 13 '13 at 22:56
  • @iOfWhy, Please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Alexander Jan 13 '13 at 22:57
  • @Alexander could someone please tell me why! Its so frickin easy, nice, fast and flexible... Why!? I MUST KNOW WHY!?!?! – iOfWhy Jan 13 '13 at 23:01
  • Do not do this job clientside! Why would you want to serve wrong markup? Instead, fix your "code" generator! – Bergi Jan 13 '13 at 23:12
  • @Bergi i am making my own, simple little rich text editor using contenteditable. All divs with class edit are clickable, once clicked made contenteditable=true and buttons are created. Once saved, the editor is "closed" but markup has to be fixed before posting back on webpage. – iOfWhy Jan 13 '13 at 23:18
  • Just don't allow an "invalid" DOM then. Before saving (or always), loop through your paragraphs and remove whitespace(-only) nodes from the end, including `
    ` elements.
    – Bergi Jan 14 '13 at 00:48
  • Good explanation [here](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). In your case you have a really limited subset of html to match, so regex *might* be good enough (and faster, easier to develop, whatever). *However* as the job becomes more complex, you'll quickly find it becomes a nightmare. – RichardTowers Jan 14 '13 at 07:53

5 Answers5

3

Using a DOM parser.

The idea is to keep all the consecutive <br> elements. Wiping the array each time a non-empty text element or any other element appears.

If there is a list of <br> by the end of the loop, then remove them. Those are the trailing <br> elements.

var $pp = document.getElementsByTagName("p");
for(var i = 0, $p = $pp[0], $br = [], alone = true; i < $pp.length; i = i + 1, $p = $pp[i], $br = [], alone = true){
  for(var j = 0, $child = $p.childNodes[0]; j < $p.childNodes.length; j = j + 1, $child = $p.childNodes[j]){
    if(($child.tagName !== "BR") && ($child.textContent.trim() !== "")){
      alone = false;
      $br = [];
    } else {
      $br.push($child);
    }
  }
  for(var j = 0; j < $br.length - alone; j = j + 1){
    $p.removeChild($br[j]);
  }  
}

For example,

<p>Foo<br><br><br></p>
<p>Foo<br>Bar<br><br></p>
<p><br></p>

becomes

<p>Foo</p>
<p>Foo<br>Bar</p>
<p><br></p>

See it here.

Disclaimer: I didn't clean it up. I will come back to it later.

Alexander
  • 23,432
  • 11
  • 63
  • 73
2

You're right, you can't use regular expressions to parse HTML because they are incapable of doing so.

Yes, you have other options. There are several forgiving HTML parsing JS libraries originally targeted to Node, but should work in the browser.

You can also just take advantage of the fact that the browser has a built-in HTML parser, and use that to parse your HTML. A DocumentFragment may be of use in this situation. Or, in your case, you can simply modify the DOM in the contenteditable element.

Community
  • 1
  • 1
josh3736
  • 139,160
  • 33
  • 216
  • 263
  • 3736 I dont really understand why i should not use regex. So far everything i have done has been really fast, easy to code and seemingly efficient. I can easily remove tags i dont want, easily replace unwanted tags with other tags... insert, remove etc etc. It seems much slower to change everything using the DOM, then i have to append every child to the new node... – iOfWhy Jan 13 '13 at 22:54
  • @iOfWhy: Because--as I mentioned--regular expressions are literally *incapable* of correctly parsing HTML. As you've found, you can use regexps to approximate your desired result *with known input*, but I guarantee I could write valid HTML that would break your regexp. (What happens if there's text that looks like HTML in a ``? `CDATA`? Self-closing tags?) [This question's answers have more details.](http://stackoverflow.com/q/590747/201952) – josh3736 Jan 13 '13 at 23:16
  • i think i get it... but i guess im not parsing html then. just cleaning up code generated in a contenteditable="true" area. – iOfWhy Jan 13 '13 at 23:29
  • @Bergi: It wasn't immediately clear where the HTML was coming from when the question was first asked. – josh3736 Jan 14 '13 at 00:52
  • @josh3736: I see. Still, I think the more relevant part of an answer would be how to remove the tags from the DOM according to the given rules. – Bergi Jan 14 '13 at 00:54
0

This seems overly complex. Did you try something simpler like:

<p>.+(<br>)+<\/p>

This should match any <br> that is enclosed within a paragraph, at the very end of it (right before the closing tag) and has something between itself and the opening tag. You should probably change it so it doesn't accept spaces as something valid, but you get the idea.

npepinpe
  • 56
  • 4
  • it looks complex cuz i wanted to delete
    as well, but thats never generated so im gonna crew that. since + is greedy yours isnt gonna work... even with .+? instead it still didnt work as expected. for now i mimicked lookbehind using [link](http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript)
    – iOfWhy Jan 13 '13 at 23:27
0

Here it is in a few lines of jQuery:

// Note: in order to load the html into the dom it needs a root. I'm using `div`:
var input = '<div>' +
  '<p>Blah<br><br><br></p> becomes <p>Blah</p>' +
  '<p><br></p> stays the same.' +
  '</div>';

// Load the html into a jQuery object:
var $html = $(input);
// Get all the `<br>`s at the end of `p`s that are not the only-child:
var $lastBreaks = $html.find('p>:last-child:not(:only-child)').filter('br');
// Remove any immediately preceding `br`s:
$lastBreaks.prevUntil(':not(br)').remove();
// Remove the last `br`s themselves
$lastBreaks.remove();

// Output:
console.log($html.html());

Outputs:

<p>Blah</p> becomes <p>Blah</p><p><br></p> stays the same.

http://jsfiddle.net/nnH4G/

The reasons that this method is better than using a regex:

  1. It's much more obvious what you're doing. When you or another developer come back to this later you won't have to think "what on earth does the regex %&^@!£%*cthulu&GJHS^&@ do?"

  2. It's easier to extend / modify. If your requirements were even slightly more complex it would become literally impossible to achieve this with (JavaScript's) regexs because of Regex and HTMLs relative positions in the Chomsky hierarchy.

  3. People who see your code will think you're generally a pretty cool guy.

jQuery is by no means the only way of doing this, as other answers have pointed out. But given how ubiquitous it is on the client side it's a pretty useful tool.

Community
  • 1
  • 1
RichardTowers
  • 4,682
  • 1
  • 26
  • 43
0

Regex solution (not that I am suggesting you should use this over DOM parsing):

I am not clear from your question what you want to happen with, for example,
'<p><br><br></p>', so there are two solutions below.

If you want it left as it is, you can use 1) if you want it to become '<p></p>' you can use 2):

1)

html = html.replace( 
    /<p>([\s\S]+?)(?:<br>)+<\/p>/g,
    function ( $0, $1 ) { return $1 == '<br>' ? $0 : '<p>' + $1 + '</p>' }
)

Test

function test(html) {
    return html.replace( 
        /<p>([\s\S]+?)(?:<br>)+<\/p>/g,
        function ( $0, $1 ) { return $1 == '<br>' ? $0 : '<p>' + $1 + '</p>' }
    )
}

test( '<p>Blah</p>' );                // <p>Blah</p>
test( '<p>Blah<br><br><br></p>' );    // <p>Blah</p>   
test( '<p><br>Blah<br></p>' );        // <p><br>Blah</p>
test( '<p><br></p>' );                // <p><br></p>
test( '<p><br><br></p>' );            // <p><br><br></p>   

2)

html = html.replace( /(?:([^>]|[^pb]>)(?:<br>)+|(?:<br>){2,})<\/p>/g, '$1</p>' );

Test

function test(html) {
    return html.replace( /(?:([^>]|[^pb]>)(?:<br>)+|(?:<br>){2,})<\/p>/g, '$1</p>' );
}

test( '<p>Blah</p>' );                // <p>Blah</p>
test( '<p>Blah<br><br><br></p>' );    // <p>Blah</p>   
test( '<p><br>Blah<br></p>' );        // <p><br>Blah</p>
test( '<p><br></p>' );                // <p><br></p>
test( '<p><br><br></p>' );            // <p></p>  
MikeM
  • 13,156
  • 2
  • 34
  • 47