Javascript Regex to match last space that is not inside an HTML tag definition

Question

I'm trying to figure out some JavaScript regex that will match the last space that is not inside an HTML tag. For example, in the following example:

// Should match the space between `custom` and `text`
My custom text;

// Should match the space between `a` and `link`
My custom text with <a href="#">a link<a/>.

// Should still match the space between `a` and `link`
My custom text with <a href="#">a link<a/><span style="color: red;">.</span>

I have the following regular expression (source, modified) that selects all spaces not in HTML tags: (?<!<[^>]*)\s(?<![^>]*<), but I'm not sure how to take it the last little bit further and select only the last of those spaces.

At first I thought I could do this: (?<!<[^>]*)\s(?<![^>]*<)(?=[^\s]*$), but that doesn't work with my last example.

Here's a fiddle.

Any ideas?

In case you were hoping for this to be reliable: you can’t use regex to determine whether a space is in an HTML tag. `(?<!<[^>]*)\s(?<![^>]*<)` has a lot of edge cases. If you want something reliable, use an HTML parser. If not, and you’d like to carry on with this regex: run it in a loop with `exec`, storing the previous match in a variable, and use the stored value when `exec` returns `null`. That’s the last match. (Also… JavaScript regex? You’re okay with the browser support of lookbehinds?) — Ry-, Feb 23 '18 at 02:24
@Ryan Thanks for the info. I guess I didn't realize that this was a tricky thing for regex. Maybe I'll consider another approach. (But hey, "you should consider a different approach altogether" is as useful an answer as any!) — Pete, Feb 23 '18 at 18:04
@KoshVery It's slightly ghetto, but basically my client really wants to avoid typographical widows. The typical approach is to add an ` ` between the last two words. I'd like to do that without breaking tags. (As a side note, I'm doing this on the admin side, prior to saving, so that I can avoid the computation and flash-before-nbsp-is-inserted that would appear if I just did it on pages when they loaded). — Pete, Feb 23 '18 at 18:13
You better go the DOM way, get the last text node within whatever element(s) you need to apply this to, and replace the last space in that with a non-breaking one. In case that replacement operation returns the same text content as before (so there was no space in this text node), move on to the second-last text node, etc. https://stackoverflow.com/a/7078792/1427878 shows a way to get all text nodes using XPath and with PHP DOM, https://stackoverflow.com/a/2579869/1427878 has several ways to do the same in JS. — CBroe, Feb 23 '18 at 19:25

score 1 · Accepted Answer · answered Feb 23 '18 at 19:16

You need \s+((\S|<[^>]+>)*)$ which looks for 1 or more spaces followed by 0 or more non-spaces or html tags.

Look at the snippet below:

var txt1 = 'My custom text.',
    txt2 = 'My custom text with <a href="#">a link<a/>',
    txt3 = 'My custom text with <a href="#">a link<a/><span style="color: red;">.</span>';

var reg = new RegExp(/\s+((\S|<[^>]+>)*)$/, 'g');

console.log(txt1.replace(reg, "&nbsp;$1"));
console.log(txt2.replace(reg, "&nbsp;$1"));
console.log(txt3.replace(reg, "&nbsp;$1"));

This definitely seems like a simple and complete solution for what I need. Thanks! — Pete, Feb 23 '18 at 21:51

Javascript Regex to match last space that is not inside an HTML tag definition

1 Answers1