Javascript Regular Expression to Parse HTML and word wrap?

Question

I need to create a bit of Javascript that can search inputted HTML from a text box and ignore all the tags to automatically word wrap at a set number like say 70 and add a <br> tag.

I also need to find all the ascii like © and  and count that as one space not 5 or 4 spaces.

So the code would take:

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.

Output would be:

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend <br>
this goes on for over 70 spaces.

Is this possible? How would I begin? Is there already a tool for this?

By the way CSS is out of the question to use.

Using regexes to match HTML is in my opinion _never_ a good idea, what about DOM-Traversing? Out of the question? — Matteo B., Dec 28 '11 at 01:00
That that number represent the number of characters in the HTML source code or the number of characters in the outputted text (the text that would be the result of parsing the HTML)? — Šime Vidas, Dec 28 '11 at 01:02
CSS is out of the question because I work for a very large company where we do email. Unfortunately multiple email platforms don't do well with CSS. — allencoded, Dec 28 '11 at 01:06
See http://stackoverflow.com/a/1732454/615754 Do you want to assume that `&anytext;` is an html entity and count it as one character, or do you want to have your code actually check for all valid html entities? — nnnnnn, Dec 28 '11 at 01:09
It would count the number of characters in the HTML source code. Only needs to ignore tags and count entities as a 1 character. — allencoded, Dec 28 '11 at 01:10
Your requirement as stated is not going to produce attractive output. If your email contains HTML markup any email client that understands it will likely wrap automatically. But _you_ want to control wrapping: OK, have you specified a [fixed pitch](http://en.wikipedia.org/wiki/Monospaced_font) font? If not wrapping at the nth character is going to look silly since the right edges won't line up. Meanwhile, email clients that don't understand HTML markup are presumably going to display all the HTML code and end up with widely varying line widths - assuming they don't autowrap too. — nnnnnn, Dec 28 '11 at 01:44
No this isn't the case. I have done this for 3 years. We just always manually break it before 70 characters. I want to do this automatically with a simple application. Don't over think this. All I need to know is if this is possible. The entities is my hardest point. Then breaking at 70. BTW this is not for a standard HTML email. In fact just ignore its purpose I know what I am doing. I just need information about the app's possibilities. — allencoded, Dec 28 '11 at 02:00
_"All I need to know is if this is possible."_ Of course it is possible. I like ranksrejoined's answer a lot. I was just suggesting that based on the information _you_ provided it didn't seem like a very useful thing to do. But if you're convinced you want to, go right ahead... — nnnnnn, Dec 28 '11 at 02:34
Similar discussion: http://stackoverflow.com/q/7434629/583539. Nope, don't rely on *regex* to parse (open-ended) HTML. — moey, Dec 28 '11 at 03:04
@allencoded - I see that you have selected an answer. Did you even try my solution? It does exactly what your question asked (I tested it pretty thoroughly). It is both faster and more accurate that the solution you selected. — ridgerunner, Dec 28 '11 at 19:01
@ridgerunner - what can you do? The world is a random place sometimes. Oh well. — Jake Feasel, Dec 28 '11 at 22:22

score 2 · Accepted Answer · edited May 23 '17 at 12:20

While the combination of the phrases "regular expression" and "parse HTML" usually causes entire universes to crumble, your use case seems simplistic enough that it could work, but the fact that you want to preserve HTML formatting after wrapping makes it much easier to just work on a space-delimited sequence. Here is a very rough approximation of what you'd like to do:

input = "<b>Hello</b> Here is some code that I would like to wrap. Let's pretend this goes on for over 70 spaces. Better &yen;&euro;&#177;, let's <em>make</em> it go on for more than 70, and pick &uuml;&thorn; a whole <strong>bu&ntilde;&copy;h</strong> of crazy symbols along the way.";
words = input.split(' ');

lengths = [];
for (var i = 0; i < words.length; i++)
  lengths.push(words[i].replace(/<.+>/g, '').replace(/&.+;/g, ' ').length);

line = [], offset = 0, output = [];
for (var i = 0; i < words.length; i ++) {
  if (offset + (lengths[i] + line.length - 1) < 70) {
    line.push(words[i]);
    offset += lengths[i];
  }
  else {
    output.push(line.join(' '));
    offset = 0; line = [], i -= 1;;
  }
  if (i == words.length - 1)
    output.push(line.join(' '));
}

output = output.join('<br />');

which results in

Hello Here is some code that I would like to wrap. Let's pretend this
goes on for over 70 spaces. Better ¥€±, let's make it go on for more
than 70, and pick üþ a whole buñ©h of crazy symbols along the way.

Note that the HTML tags (b, em, strong) are preserved, it's just that Markdown doesn't show them.

Basically, the input string is split into words at each space, which is naïve and likely to cause trouble, but it's a start. Then, the length of each word is calculated after anything resembling an HTML tag or entity has been removed. Then it's a simple matter of iterating over each word, keeping a running tally of the column we're on; once we've struck 70, we pop the aggregated words into the output string and reset. Again, it's very rough, but it should suffice for most basic HTML.

+1. I like it. (Though I'm a little confused why you couldn't show the html markup in your sample output.) — nnnnnn, Dec 28 '11 at 02:32
Well, I wanted a monospace font to demonstrate that everything is being properly wrapped, and blockquote‒while allowing for bold and italic text‒doesn't do that. The only other option was to mark it as code output, but that prints raw asterisks and underlines. Is there some way to bold and italicize code that I'm not seeing? — ranksrejoined, Dec 28 '11 at 02:39
Fair enough. And no, you can't bold text already formatted as code (at least, I don't know how). — nnnnnn, Dec 28 '11 at 02:48

ridgerunner · Answer 2 · 2011-12-28T02:17:50.130

This solution "walks" the string token by token counting up to the desired line length. The regex captures one of four different tokens:

$1: HTML open/close tag (width = 0)
$2: HTML entity. (width = 1)
$3: Line terminator. (counter is reset)
$4: Any other character. (width = 1)

Note that I've added a line terminator token in case your textbox is already formatted with linefeed (with optional carriage returns). Here is a JavaScript function that walks the string using String.replace() and an anonymous callback counting tokens as it goes:

function breakupHTML(text, len);

// Break up textarea into lines having len chars.
function breakupHTML(text, len) {
    var re = /(<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)|(&(?:\w+|#x[\da-f]+|#\d+);)|(\r?\n)|(.)/ig;
    var count = 0;  // Initialize line char count.
    return text.replace(re,
        function(m0, m1, m2, m3, m4) {
            // Case 1: An HTML tag. Do not add to count.
            if (m1) return m1;
            // Case 2: An HTML entity. Add one to count.
            if (m2) {
                if (++count >= len) {
                    count = 0;
                    m2 += '<br>\n';
                }
                return m2;
            }
            // Case 3: A hard coded line terminator.
            if (m3) {
                count = 0;
                return '<br>\n';
            }
            // Case 4: Any other single character.
            if (m4) {
                if (++count >= len) {
                    count = 0;
                    m4 += '<br>\n';
                }
                return m4;
            } // Never get here.
        });
}

Here's a breakdown of the regex in commented format so you can see what is being captured:

p = re.compile(r"""
    # Match one HTML open/close tag, HTML entity or other char.
      (<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)  # $1: HTML open/close tag
    | (&(?:\w+|\#x[\da-f]+|\#\d+);)      # $2: HTML entity.
    | (\r?\n)                            # $3: Line terminator.
    | (.)                                # $4: Any other character.
    """, re.IGNORECASE | re.VERBOSE)

score 0 · Answer 3 · edited May 23 '17 at 11:55

Not wanting to unleash Cthulhu, I decided (unlike my fellow answers) to instead provide an answer to your problem that does not attempt to parse HTML with regular expressions. Instead, I turned to the awe-inspiring force for good that is jQuery, and used that to parse your HTML on the client side.

A working fiddle: http://jsfiddle.net/CKQ9f/6/

The html:

<div id="wordwrapOriginal">Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g<b class="foo bar">Helloend this goes on for over 70 spaces.etend</b>Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g</div>
<hr>
<div id="wordwrapResult"></div>

The jQuery:

// lifted from here: https://stackoverflow.com/a/5259788/808921
$.fn.outerHTML = function() {
    $t = $(this);
    if( "outerHTML" in $t[0] )
    { return $t[0].outerHTML; }
    else
    {
        var content = $t.wrap('<div></div>').parent().html();
        $t.unwrap();
        return content;
    }
}

// takes plain strings (no markup) and adds <br> to 
// them when each "line" has exceeded the maxLineLen
function breakLines(text, maxLineLen, startOffset)
{
   var returnVals = {'text' : text, finalOffset : startOffset + text.length};
   if (text.length + startOffset > maxLineLen)
   {
      var wrappedWords = "";
      var wordsArr = text.split(' ');
      var lineLen = startOffset;
      for (var i = 0; i < wordsArr.length; i++)
      {
        if (wordsArr[i].length + lineLen > maxLineLen)
        {
          wrappedWords += '<br>';
          lineLen = 0;
        } 
        wrappedWords += (wordsArr[i] + ' ');
        lineLen += (wordsArr[i].length + 1);
      } 
      returnVals['text'] = wrappedWords.replace(/\s$/, '');
      returnVals['finalOffset'] = lineLen;
   }
   return returnVals;
}

// recursive function which will traverse the "tree" of HTML 
// elements under the baseElem, until it finds plain text; at which 
// point, it will use the above function to add newlines to that text
function wrapHTML(baseElem, maxLineLen, startOffset)
{
    var returnString = "";
    var currentOffset = startOffset;

    $(baseElem).contents().each(function () {
        if (! $(this).contents().length) // plain text
        {
            var tmp = breakLines($(this).text(), maxLineLen, currentOffset);
            returnString += tmp['text'];
            currentOffset = tmp['finalOffset'];

        }
        else // markup
        {
            var markup = $(this).clone();
            var tmp = wrapHTML(this, maxLineLen, currentOffset);
            markup.html(tmp['html']);
            returnString += $(markup).outerHTML();
            currentOffset = tmp['finalOffset'];
        }
    });

    return {'html': returnString, 'finalOffset': currentOffset};
}


$(function () {

   wrappedHTML = wrapHTML("#wordwrapOriginal", 70, 0);

   $("#wordwrapResult").html(wrappedHTML['html']);

});

Note the recursion - can't do that with a regex!

Javascript Regular Expression to Parse HTML and word wrap?

3 Answers3

function breakupHTML(text, len);