JavaScript validation issue with international characters

Question

We use the excellent validator plugin for jQuery here on Stack Overflow to do client-side validation of input before it is submitted to the server.

It generally works well, however, this one has us scratching our heads.

The following validator method is used on the ask/answer form for the user name field (note that you must be logged out to see this field on the live site; it's on every /question page and the /ask page)

$.validator.addMethod("validUserName",
  function(value, element) {
  return this.optional(element) || 
  /^[\w\-\s\dÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/.test(value); },
  "Can only contain A-Z, 0-9, spaces, and hyphens.");

Now this regex looks weird but it's pretty simple:

match the beginning of the string (^)
match any of these..
- word character (\w)
- dash (-)
- space (\s)
- digit (\d)
- crazy moon language characters (àèìòù etc)
now match the end of the string ($)

Yes, we ran into the Internationalized Regular Expressions problem. JavaScript's definition of "word character" does not include international characters.. at all.

Here's the weird part: even though we've gone to the trouble of manually adding tons of the valid international characters to the regex, it doesn't work. You cannot enter these international characters in the input box for user name without getting the..

Can only contain A-Z, 0-9, spaces, and hyphens

.. validation return!

Obviously the validation is working for the other parts of the regex.. so.. what gives?

The other strange part is that this validation works in the browser's JavaScript console but not when executed as a part of our standard *.js includes.

/^[\w-\sÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/ .test('ÓBill de hÓra') === true

We've run into some really bizarre international character issues in JavaScript code before, resulting in some very, very nasty hacks. We'd like to understand what's going on here and why. Please enlighten us!

Could this be a character encoding problem? I.e., a crazy moon "Ä" coming from the user is not an "Ä" in your regex? — balpha, Jul 02 '09 at 09:42
I don't know the answer but that's a good way to write up a question. — Onorio Catenacci, Jul 02 '09 at 09:42
@Onorio Jeff always advocates asking well-written questions, so he better be doing that himself, too :-) But you're certainly right. — balpha, Jul 02 '09 at 09:45
é is not a character from a moon language, pokémon is in the english alphabet is it not? Also check my comment Jorn answer — Hoffmann, Nov 23 '12 at 17:50

Jörn Zaefferer · Accepted Answer · 2017-09-15T13:49:23.437

I think the email and url validation methods are a good reference here, eg. the email method:

email: function(value, element) {
    return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i.test(value);
},

The script to compile that regex.

In other words, replacing your arbitrary list of "crazy moon" characters with this could help:

[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]

Basically this avoids the character encoding issues you have elsewhere by replacing the needs-encoding characters with more general definitions. While not necessarily more readable, so far it's shorter than your full list.

Just to clarify on why this worked. If your .js file is enconded in a character encoding all characters inside regex expressions inside it will be represented on that encoding, even if your webpage uses another encoding. In my projects I simply encode EVERYTHING that can contain international strings in UTF-8. This includes .js files. What probably happened to Jeff was that his .js files were encoded in a charset and his page was parsed with another charset, his HTTP requests/response probably encoded with the same charset as the page. This explains why it worked on the debugger. — Hoffmann, Nov 23 '12 at 17:57
Another thing, try alert("áéíóú") if it shows right your javascript file is encoded in the same encoding as your page. Yet another solution is to simply include your javascripts with: where ISOsomething is the encoding of your .js file. This is a common error because most IDEs create .js files in their default encoding which is almost never UTF-8 by default. — Hoffmann, Nov 23 '12 at 18:01
this helped me, supports i18n chars and NO double quotes: `^[a-zA-Z0-9!@#$%^~&*/?:'`\,\\|{}()-_+\s\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]*$` — STEEL, Mar 23 '17 at 04:50

score 14 · Answer 2 · answered Jul 02 '09 at 10:07

This isn't really an answer but I don't have 50 rep yet to add a comment... It can definately be attributed to encoding issues.

Yea "ECMA shouldn't care about encoding..." blah blah, well if you're on firefox, go to View > Character Encoding > Western (ISO-8859-1) then try using the Name field.

It works fine for me after changing the encoding manually (granted the rest of the page doesn't like the encoding switch, :P)

(on IE8 you can go to Page > Encoding > Western European (Windows) to get the same effect)

he's right, this magically makes the Name: validation work (!) — Jeff Atwood, Jul 02 '09 at 10:22

Boldewyn · Answer 3 · 2009-07-02T10:58:32.583

3

What is the character encoding of the JS file?

For XML QNames I use this RegExp:

/**
 * Definition of an XML Name
 */
var NameStartChar = "A-Za-z:_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D"+
                    "\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF"+
                    "\uF900-\uFDCF\uFDF0-\uFFFD\u010000-\u0EFFFF";
var NameChar = NameStartChar+"\\-\\.0-9\u00B7\u0300-\u036F\u203F-\u2040";
var Name = "^["+NameStartChar+"]["+NameChar+"]*$";
RegExp (Name).test (value);

It works like a charm also with internationalized characters. Note the escaping. Due to that I'm able to restrict the JS file to ASCII characters only. Therefore I don't get into trouble when dealing with ISO-8859 vs UTF-8 charsets.

This is no more true, if you use character encodings where ASCII is no real subset (like, e.g., in Asia UTF-16).

Cheers,

edited Jul 02 '09 at 10:58

answered Jul 02 '09 at 09:47

Boldewyn

81,211
44
156
212

As I understood, the validator rules are in an external JS file. Then I bet on that file being in the wrong encoding (i.e., not UTF-8). – Boldewyn Jul 02 '09 at 09:57
I am opening the file on disk in Notepad2 and it looks correct -- identical to what you see above in ANSI and when I switch to Unicode, UTF-8 encodings, also identical. – Jeff Atwood Jul 02 '09 at 10:19
That can't be. An ANSI 'Ä' (==ISO-8859-1) has a single-byte representation 'C4', while UTF-8 'Ä' looks in a hex editor like 'C3 84'. What do you mean with 'switch'? Is it real conversion between encodings? – Boldewyn Jul 02 '09 at 10:51
well, I'm opening the .js file from the server itself in Notepad2 and switching file encodings via the drop-down menu. I can't see any differences in any of them for the regex string. It is entirely possible I'm doing something wrong.. – Jeff Atwood Jul 02 '09 at 11:29
weirdly, this matches true on a string containing a "<". Seemingly because of the last bit of the NameStartChar "\u010000-\u0EFFFF", even though < is \u003C and not in that range. Similarly @, ?, =, and other characters between '9' and 'A'. thoughts on why? – jwl Sep 03 '10 at 21:26
@larson4: Hm, it can be that your JS engine cuts off after the first 4 digits. But then, `\u0100` still doesn't contain the `<`. Strange, indeed. – Boldewyn Sep 06 '10 at 07:35
I have created a javascript library to do some of this stuff, not sure how correct or optimal it is, but take a look: http://code.google.com/p/charfunk/ – jwl Sep 08 '10 at 22:50

score 3 · Answer 4 · edited Jun 23 '20 at 13:57

Late to the game here, but I just used this expression and it seemed to work well for me. Seems to be fairly comprehensive and relatively simple:

var re = /^[A-zÀ-Ÿ\s\d-]*$/g; 
var str1 = 'casa-me,pois 99 estou farto! Eis a lista:uma;duas;três';
var str2 = 'casa-me pois 99 estou farto Eis a lista uma duas três';
var str3 = 'àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ'

alert(re.test(str1));
alert(re.test(str2));
alert(re.test(str3));

score 2 · Answer 5 · answered Jul 02 '09 at 09:41

2

international characters listed are part of extended ASCII. the ones added by you are certainly not.

answered Jul 02 '09 at 09:41

dusoft

11,289
5
38
44

score 2 · Answer 6 · answered Jul 02 '09 at 09:46

2

Seeing as the statement works in the console, could this have to do the way your .js files are saved (i.e. ascii or UTF-8) and that the browser is loading them thusly and in the process translates the characters?

answered Jul 02 '09 at 09:46

Colin

10,630
28
36

JS doesn't know anything about UTF-8, even if the encoding is set so. – dusoft Jul 02 '09 at 09:47
But the browser does, doesn't it? What if the file is loaded as UTF-8 and the JS engine of the browser interprets the characters wrongly because the browser loaded the file incorrectly ? – Colin Jul 02 '09 at 09:51
2

Yep, the browser cares. If you save an "Ä" as not-Unicode, it will result in an invalid UTF-8 byte stream. Therefore, it never can match an UTF-8 byte stream corresponding to "Ä". – Boldewyn Jul 02 '09 at 09:53
s/browser cares/browser and hence the JS engine cares/ – Boldewyn Jul 02 '09 at 09:55

score 2 · Answer 7 · answered Jul 02 '09 at 10:09

Use something like Fiddler or Charles (not Firebug's Net panel, or anything else that's actually inside the browser) to examine what's actually coming over the wire. It's almost certainly an encoding issue: either the file has been saved in some Microsoft character set and is being sent as UTF-8, or maybe the other way around.

In the case of JS RegExps you can, as Boldewyn points out, avoid these problems by specifying the Unicode code point for the characters you want that are outside the US-ASCII range. It would still be as well to make sure you aren't mixing up encodings between the place where the file is saved and the place where it's served, though.

Both Fiddler and Charles can deal with that. IIRC Fiddler (at least in version 2) will offer you a button in the Response viewing area to allow you to view the ungzipped content. — NickFitz, Jul 02 '09 at 16:46

JavaScript validation issue with international characters

7 Answers7

Linked