While the combination of the phrases "regular expression" and "parse HTML" usually causes entire universes to crumble, your use case seems simplistic enough that it could work, but the fact that you want to preserve HTML formatting after wrapping makes it much easier to just work on a space-delimited sequence. Here is a very rough approximation of what you'd like to do:
input = "<b>Hello</b> Here is some code that I would like to wrap. Let's pretend this goes on for over 70 spaces. Better ¥€±, let's <em>make</em> it go on for more than 70, and pick üþ a whole <strong>buñ©h</strong> of crazy symbols along the way.";
words = input.split(' ');
lengths = [];
for (var i = 0; i < words.length; i++)
lengths.push(words[i].replace(/<.+>/g, '').replace(/&.+;/g, ' ').length);
line = [], offset = 0, output = [];
for (var i = 0; i < words.length; i ++) {
if (offset + (lengths[i] + line.length - 1) < 70) {
line.push(words[i]);
offset += lengths[i];
}
else {
output.push(line.join(' '));
offset = 0; line = [], i -= 1;;
}
if (i == words.length - 1)
output.push(line.join(' '));
}
output = output.join('<br />');
which results in
Hello Here is some code that I would like to wrap. Let's pretend this
goes on for over 70 spaces. Better ¥€±, let's make it go on for more
than 70, and pick üþ a whole buñ©h of crazy symbols along the way.
Note that the HTML tags (b
, em
, strong
) are preserved, it's just that Markdown doesn't show them.
Basically, the input string is split into words at each space, which is naïve and likely to cause trouble, but it's a start. Then, the length of each word is calculated after anything resembling an HTML tag or entity has been removed. Then it's a simple matter of iterating over each word, keeping a running tally of the column we're on; once we've struck 70, we pop the aggregated words into the output string and reset. Again, it's very rough, but it should suffice for most basic HTML.