2

Lets say you have an HTML string like this:

<div id="loco" class="hey" >lorem ipsum pendus <em>hey</em>moder <hr /></div>

And need to place <br/> elements after every space character.... which I was doing with:

HTMLtext.replace(/\s{1,}/g, ' <br/>');

However, the problem is that this inserts breaks after space characters in-between tags (between tag properties) too and I'd of course like to do this for tag textual contents only. Somehow I was always really bad with regular expressions - could anyone help out?

So basically do my original whitespace match but only if its not between < and > ?

Michael
  • 1,742
  • 3
  • 18
  • 24
  • 3
    In the general case, parsing HTML with regular expressions is not possible. It can only be done when you know that the HTML source is constrained in particular ways. If it can really be any arbitrary fragment of HTML, then you can't do it with a regular expression. Give the HTML to the browser, let it build a DOM fragment, and then look for the text nodes and modify those. – Pointy Oct 04 '12 at 15:21
  • Thanks, yeah I'm aware of those issues - but I'm not really parsing HTML. Just trying to get the whitespace characters. And I am doing it in a limited environment with HTML that I control. I could do it with DOM (and I did do it originally) - but I'm trying to avoid that since DOM operations are costly and I'm trying to optimise the code a bit. – Michael Oct 04 '12 at 15:30
  • Well the thing is that in order to identify which whitespace characters you have to replace and which you don't, you have to come pretty close to parsing the HTML. – Pointy Oct 04 '12 at 15:32
  • just to be sure its not within < and > - don't care about anything else... but thats not that complex of a case I don't think – Michael Oct 04 '12 at 15:33
  • Yes that should be OK if you're sure that there won't be angle brackets inside attribute values, no CDATA sections, etc. – Pointy Oct 04 '12 at 15:34
  • @Pointy - You don't need to parse the HTML string. A simple lexer would do the job. See my answer. – Aadit M Shah Oct 04 '12 at 15:56

3 Answers3

4

Regex is not a good tool for this. You should be working with the DOM, not with the raw HTML string.

For a quick-and-dirty solution that presupposes that there are no < or > character in your string except those delimiting a tag, you can try this, though:

result = subject.replace(/\s+(?=[^<>]*<)/g, "$&<br/>");

This inserts a <br/> after whitespace only if the next angle bracket is an opening angle bracket.

Explanation:

\s+     # Match one or more whitespace characters (including newlines!)
(?=     # but only if (positive lookahead assertion) it's possible to match...
 [^<>]* #  any number of non-angle brackets
 <      #  followed by an opening angle bracket
)       # ...from this position in the string onwards.

Replace that with $& (which contains the matched characters) plus <br/>.

This regex does not check if there is a > further behind, as this would require a positive look*behind* assertion, and JavaScript does not support these. So you can't check for that, but if you control the HTML and are sure that the conditions I mentioned above are met, that shouldn't be a problem.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I guess the lookahead should be `[^<>]*(<|$)` – Bergi Oct 04 '12 at 15:28
  • @Bergi: I guess we don't want to insert `
    ` after the last tag, do we? I'm not sure...
    – Tim Pietzcker Oct 04 '12 at 15:29
  • Thanks @TimPietzcker - that seems to work great! I had that working in DOM already but this is actually an attempt to optimise my JS code. That function can process a lot of text so tinkering with the DOM directly is not the best solution for that I think. Can you walk me through that a bit? I'm assuming its matching whitespace as long as its not between < and > right? – Michael Oct 04 '12 at 15:37
  • @Tim: Maybe not after the last word (assuming there is a whitespace), but still any whitespace should be replaced if there is no tag ahead of them. – Bergi Oct 04 '12 at 15:42
  • 1
    @Michael: I've edited my answer. Hope this makes it clearer. And be sure to take Bergi's advice into consideration - if you want to match whitespace *after* the last tag in a string, you need to use `/\s+(?=[^<>]*(?:<|$))/g` instead. – Tim Pietzcker Oct 04 '12 at 15:45
2

See this answer for iterating the dom and replacing whitespaces with <br /> elements. The adapted code would be:

(function iterate_node(node) {
    if (node.nodeType === 3) { // Node.TEXT_NODE
        var text = node.data,
            words = text.split(/\s/);
        if (words.length > 1) {
            node.data = words[0];
            var next = node.nextSibling,
                parent = node.parentNode;
            for (var i=1; i<words.length; i++) {
                var tnode = document.createTextNode(words[i]),
                    br = document.createElement("br");
                parent.insertBefore(br, next);
                parent.insertBefore(tnode, next);
            }
        }
    } else if (node.nodeType === 1) { // Node.ELEMENT_NODE
        for (var i=node.childNodes.length-1; i>=0; i--) {
            iterate_node(node.childNodes[i]); // run recursive on DOM
        }
    }
})(content); // any dom node

(Demo at jsfiddle.net)

Community
  • 1
  • 1
Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • 1
    Thanks, but as mentioned in other comments this was an optimisation attempt. I had this in DOM already but the theory was that DOM operations would be much more costly than just playing with strings. Which also makes sense. I did a quick check here http://jsperf.com/spacereplacer to see if is faster and yeah - on most browsers its twice as fast... – Michael Oct 04 '12 at 16:10
  • @Michael: Thanks for actually profiling this. It's very interesting that Safari manages to manipulate the DOM just as fast as the regex (and is fastest overall), so it's not a built-in advantage of regexes. – Tim Pietzcker Oct 05 '12 at 11:10
  • Yeah I was surprised by that too - but I think its a bug in Safari 6 (or the benchmarkjs powering jsperf). I just tested with with an older Safari 5.1 and its the same result from other browsers. Plus similar WebKit engines from iPhone/iPad/Chrome are all the same... and in the end it also doesn't make sense that tinkering with DOM would be the same speed as purely mathematical/string operations... – Michael Oct 05 '12 at 16:20
0

Okay, so you don't want to match spaces inside HTML tags. Only regular expressions isn't sufficient for this. I'll use a lexer to do the job. You can see the output here.

var lexer = new Lexer;

var result = "";

lexer.addRule(/</, function (c) { // start of a tag
    this.state = 2; // go to state 2 - exclusive tag state
    result += c; // copy to output
});

lexer.addRule(/>/, function (c) { // end of a tag
    this.state = 0; // go back to state 0 - initial state
    result += c; // copy to output
}, [2]); // only apply this rule when in state 2

lexer.addRule(/.|\n/, function (c) { // match any character
    result += c; // copy to output
}, [2]); // only apply this rule when in state 2

lexer.addRule(/\s+/, function () { // match one or more spaces
    result += "<br/>"; // replace with "<br/>"
});

lexer.addRule(/.|\n/, function (c) { // match any character
    result += c; // copy to output
}); // everything else

lexer.input = '<div id="loco" class="hey" >lorem ipsum pendus <em>hey</em>moder <hr /></div>';

lexer.lex();

Of course, a lexer is a very powerful tool. You may also skip angled brackets inside the value of an attribute in a tag. However I'll leave that for you to implement. Good luck.

Aadit M Shah
  • 72,912
  • 30
  • 168
  • 299
  • Thanks Aadit - thats a very interesting library there, but I don't really put external code in there as I'm trying to optimise mine. Plus by taking a quick look ar your library source - you're still using regular expressions right? – Michael Oct 04 '12 at 16:14
  • Yes, I am. Why reinvent the wheel? =) – Aadit M Shah Oct 04 '12 at 16:24