0

I use a regex for my splitfunction.

string.split(/\s/)

But   (which is a Hair Space), will not be recognised. How to make sure it does (without implementing the exact code in the regex expression)

Nicky Smits
  • 2,980
  • 4
  • 20
  • 27

1 Answers1

1

Per MDN, the definition of \s in a regex (in the Firefox browser) is this:

[ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000]

So, if you want to split on something in addition to this (e.g. an HTML entity), then you will need to add that to your own regex. Remember, string.split() is not an HTML function, it's a string function so it doesn't know anything special about HTML. If you want to split on certain HTML tags or entities, you will have to code up a regex that includes the things you want to split on.

You can code for it yourself like this:

string.split(/\s| /);

Working demo: http://jsfiddle.net/jfriend00/nAQ97/


If what you really want to do is to have your HTML parsed and converted to text by the browser (which will process all entities and HTML tags), then you can do this:

function getPlainText(str) {
    var x = document.createElement("div");
    x.innerHTML = str;
    return (x.textContent || x.innerText);
}

Then, you could split your string like this:

getPlainText(str).split(/\s/);

Working demo: http://jsfiddle.net/jfriend00/KR2aa/


If you want to make absolutely sure this works in older browsers, you'd either have to test one of these above functions in all browsers that you care about or you'd have to use a custom regex with all the entities you want to split on in the first option or do a search/replace on all unicode characters that you want to split on in the second option and turn them into a regular space before doing the split. Because older browsers weren't very consistent here, there is no free lunch if you want safe compatibility with old browsers.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • What about [this](http://stackoverflow.com/questions/10715801/javascript-decoding-html-entities)? – tenub Mar 20 '14 at 00:16
  • @tenub - that depends upon what the OP really wants. If they want to decode all the HTML in their string and use the automatic conversion to text, then they could use a solution like that. If they want to just split a piece of string based on a specific set of criteria, then they should probably just use a regex for their specific criteria. – jfriend00 Mar 20 '14 at 00:19
  • `\u200a` in js is the same character as ` ` in html. – david Mar 20 '14 at 00:37
  • `The definition of \s in a regex is...`. I think there are minor differences between browsers, do you have a reference for your list? The [list in ES5](http://ecma-international.org/ecma-262/5.1/#sec-15.10) (which is [whitespace](http://ecma-international.org/ecma-262/5.1/#sec-7.2) plus [line terminators](http://ecma-international.org/ecma-262/5.1/#sec-7.3)) is not so extensive (and not representative of what browsers actually do). – RobG Mar 20 '14 at 00:42
  • @RobG - here's [my reference](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) which is what Firefox implements. I added the reference to my answer also. – jfriend00 Mar 20 '14 at 00:44
  • Oh, just noticed on the bottom of the Whitespace Characters list: *Any other Unicode “space separator”*. Cool. – RobG Mar 20 '14 at 00:49
  • @jfriend00—here's a discussion on [comp.lang.javascript](https://groups.google.com/forum/#!searchin/comp.lang.javascript/whitespace/comp.lang.javascript/yjWhmD-z12c/FMi5juF3h5YJ) that should be informative. There is also [stuff from JR Stockton](http://www.merlyn.demon.co.uk/js-valid.htm#Fred) based on what browsers (from a few years ago) actually do. – RobG Mar 20 '14 at 00:54
  • @RobG - yeah it seems kind of wimpy that the ES5 spec doesn't delineate exactly which unicode characters should be construed as whitespace. As they wrote it, they purposely left it up to the browser or regex implementer. – jfriend00 Mar 20 '14 at 01:14