Understanding whitespace regex in jQuery source

Question

I was just trying to understand jQuery source of the white space trim REGEX and came across the following:

rtrim = /^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g,

Now using a REGEX TOOL , i understood the following:

/^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g
1st Alternative: ^[\s\uFEFF\xA0]+
^ assert position at start of the string
[\s\uFEFF\xA0]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ]
\uFEFF matches the character uFEFF literally (case sensitive)
\xA0 matches the character   with position 0xA0 (160 decimal or 240 octal) in the character set
2nd Alternative: [\s\uFEFF\xA0]+$
[\s\uFEFF\xA0]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ]
\uFEFF matches the character uFEFF literally (case sensitive)
\xA0 matches the character   with position 0xA0 (160 decimal or 240 octal) in the character set
$ assert position at end of the string
g modifier: global. All matches (don't return on first match)

The above description makes the REGEX very easy to understand, but still thinking about the implementation practically, a few things don't make sense , I.E.

uFEFF why would a sting ever have this character and what does it have to do with white spaces ? And also what on earth is xA0 ?

Can anybody explain ? you don't have to give the most detailed answer a short brief one will do.

This might give some info on uFEFF: http://stackoverflow.com/questions/17912307/u-ufeff-in-python-string — lintmouse, Sep 28 '15 at 19:36
You are about to go down the long unforuntate road of text encoding, `\xA0` is a non-breaking space in the Latin1 charset. jQuery deals with multiple ways you can represent a space depending on the environment. — IanGabes, Sep 28 '15 at 19:40

miken32 · Accepted Answer · 2015-09-29T19:23:29.753

2

0xFEFF is known as ZERO WIDTH NO-BREAK SPACE and is possibly not caught on some browsers by using \s alone. Ditto for 0x00A0, NO-BREAK SPACE.

See this document for some more detail on what is caught by \s in ECMA 262 (which is the standard for Javascript.) According to that spec, jQuery is being overly cautious since the characters in question are already included. Likely this is due to browser compatibility.

edited Sep 29 '15 at 19:23

answered Sep 28 '15 at 19:41

miken32

42,008
16
111
154

1

Please don't refer to the documentation of other languages - in cases of `\s` and `\w` - each flavor may have their own implementation which add or remove some code point from the set. You should refer to [MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp) or ECMA specs for the list of characters. – nhahtdh Sep 29 '15 at 03:56
Good point; I dug into the spec and found the characters listed and have adjusted the link in the answer to point there. – miken32 Sep 29 '15 at 19:25
Btw, `\s` only includes U+FEFF (used as BOM in UTF-16) from ECMAScript 5 (dated 2009). Previously, in ECMAScript 3, U+FEFF is not included (it's in Cf category, not Zs). I guess Android is behind the specs for quite some time. – nhahtdh Sep 29 '15 at 19:29
Current spec is 5.1 from June 2011, and includes both 0xFEFF and 0x00A0. Of course jQuery wants to be compatible with everything back to IE 7 or maybe 6. – miken32 Sep 29 '15 at 19:58
What I mean is that Android's implementation of `\s` was behind the specs (and it's rather normal - though they take their time to catch up with the specs). I don't think jQuery is compatible with IE<=8 with that code. See http://stackoverflow.com/questions/21371713/string-match-for-regex-s-on-chinese-string-works-differently-between-ie8-and/21377618#21377618 – nhahtdh Sep 29 '15 at 20:08

Understanding whitespace regex in jQuery source

1 Answers1