23

We use the excellent validator plugin for jQuery here on Stack Overflow to do client-side validation of input before it is submitted to the server.

It generally works well, however, this one has us scratching our heads.

The following validator method is used on the ask/answer form for the user name field (note that you must be logged out to see this field on the live site; it's on every /question page and the /ask page)

$.validator.addMethod("validUserName",
  function(value, element) {
  return this.optional(element) || 
  /^[\w\-\s\dÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇߨøÅ寿ÞþÐð]+$/.test(value); },
  "Can only contain A-Z, 0-9, spaces, and hyphens.");  

Now this regex looks weird but it's pretty simple:

  • match the beginning of the string (^)
  • match any of these..
    • word character (\w)
    • dash (-)
    • space (\s)
    • digit (\d)
    • crazy moon language characters (àèìòù etc)
  • now match the end of the string ($)

Yes, we ran into the Internationalized Regular Expressions problem. JavaScript's definition of "word character" does not include international characters.. at all.

Here's the weird part: even though we've gone to the trouble of manually adding tons of the valid international characters to the regex, it doesn't work. You cannot enter these international characters in the input box for user name without getting the..

Can only contain A-Z, 0-9, spaces, and hyphens

.. validation return!

Obviously the validation is working for the other parts of the regex.. so.. what gives?

The other strange part is that this validation works in the browser's JavaScript console but not when executed as a part of our standard *.js includes.

/^[\w-\sÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇߨøÅ寿ÞþÐð]+$/ .test('ÓBill de hÓra') === true

We've run into some really bizarre international character issues in JavaScript code before, resulting in some very, very nasty hacks. We'd like to understand what's going on here and why. Please enlighten us!

alex
  • 479,566
  • 201
  • 878
  • 984
Jeff Atwood
  • 63,320
  • 48
  • 150
  • 153
  • Could this be a character encoding problem? I.e., a crazy moon "Ä" coming from the user is not an "Ä" in your regex? – balpha Jul 02 '09 at 09:42
  • I don't know the answer but that's a good way to write up a question. – Onorio Catenacci Jul 02 '09 at 09:42
  • @Onorio Jeff always advocates asking well-written questions, so he better be doing that himself, too :-) But you're certainly right. – balpha Jul 02 '09 at 09:45
  • é is not a character from a moon language, pokémon is in the english alphabet is it not? Also check my comment Jorn answer – Hoffmann Nov 23 '12 at 17:50

7 Answers7

36

I think the email and url validation methods are a good reference here, eg. the email method:

email: function(value, element) {
    return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i.test(value);
},

The script to compile that regex.

In other words, replacing your arbitrary list of "crazy moon" characters with this could help:

[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]

Basically this avoids the character encoding issues you have elsewhere by replacing the needs-encoding characters with more general definitions. While not necessarily more readable, so far it's shorter than your full list.

Jörn Zaefferer
  • 5,665
  • 3
  • 30
  • 34
  • Just to clarify on why this worked. If your .js file is enconded in a character encoding all characters inside regex expressions inside it will be represented on that encoding, even if your webpage uses another encoding. In my projects I simply encode EVERYTHING that can contain international strings in UTF-8. This includes .js files. What probably happened to Jeff was that his .js files were encoded in a charset and his page was parsed with another charset, his HTTP requests/response probably encoded with the same charset as the page. This explains why it worked on the debugger. – Hoffmann Nov 23 '12 at 17:57
  • Another thing, try alert("áéíóú") if it shows right your javascript file is encoded in the same encoding as your page. Yet another solution is to simply include your javascripts with: where ISOsomething is the encoding of your .js file. This is a common error because most IDEs create .js files in their default encoding which is almost never UTF-8 by default. – Hoffmann Nov 23 '12 at 18:01
  • Both links in the answer are broken. – Mottie Feb 03 '17 at 13:45
  • this helped me, supports i18n chars and NO double quotes: `^[a-zA-Z0-9!@#$%^~&*/?:'`\,\\|{}()-_+\s\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]*$` – STEEL Mar 23 '17 at 04:50
14

This isn't really an answer but I don't have 50 rep yet to add a comment... It can definately be attributed to encoding issues.

Yea "ECMA shouldn't care about encoding..." blah blah, well if you're on firefox, go to View > Character Encoding > Western (ISO-8859-1) then try using the Name field.

It works fine for me after changing the encoding manually (granted the rest of the page doesn't like the encoding switch, :P)

(on IE8 you can go to Page > Encoding > Western European (Windows) to get the same effect)

scott
  • 946
  • 1
  • 6
  • 6
3

What is the character encoding of the JS file?

For XML QNames I use this RegExp:

/**
 * Definition of an XML Name
 */
var NameStartChar = "A-Za-z:_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D"+
                    "\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF"+
                    "\uF900-\uFDCF\uFDF0-\uFFFD\u010000-\u0EFFFF";
var NameChar = NameStartChar+"\\-\\.0-9\u00B7\u0300-\u036F\u203F-\u2040";
var Name = "^["+NameStartChar+"]["+NameChar+"]*$";
RegExp (Name).test (value);

It works like a charm also with internationalized characters. Note the escaping. Due to that I'm able to restrict the JS file to ASCII characters only. Therefore I don't get into trouble when dealing with ISO-8859 vs UTF-8 charsets.

This is no more true, if you use character encodings where ASCII is no real subset (like, e.g., in Asia UTF-16).

Cheers,

Boldewyn
  • 81,211
  • 44
  • 156
  • 212
  • As I understood, the validator rules are in an external JS file. Then I bet on that file being in the wrong encoding (i.e., not UTF-8). – Boldewyn Jul 02 '09 at 09:57
  • I am opening the file on disk in Notepad2 and it looks correct -- identical to what you see above in ANSI and when I switch to Unicode, UTF-8 encodings, also identical. – Jeff Atwood Jul 02 '09 at 10:19
  • That can't be. An ANSI 'Ä' (==ISO-8859-1) has a single-byte representation 'C4', while UTF-8 'Ä' looks in a hex editor like 'C3 84'. What do you mean with 'switch'? Is it real conversion between encodings? – Boldewyn Jul 02 '09 at 10:51
  • well, I'm opening the .js file from the server itself in Notepad2 and switching file encodings via the drop-down menu. I can't see any differences in any of them for the regex string. It is entirely possible I'm doing something wrong.. – Jeff Atwood Jul 02 '09 at 11:29
  • weirdly, this matches true on a string containing a "<". Seemingly because of the last bit of the NameStartChar "\u010000-\u0EFFFF", even though < is \u003C and not in that range. Similarly @, ?, =, and other characters between '9' and 'A'. thoughts on why? – jwl Sep 03 '10 at 21:26
  • @larson4: Hm, it can be that your JS engine cuts off after the first 4 digits. But then, `\u0100` still doesn't contain the `<`. Strange, indeed. – Boldewyn Sep 06 '10 at 07:35
  • I have created a javascript library to do some of this stuff, not sure how correct or optimal it is, but take a look: http://code.google.com/p/charfunk/ – jwl Sep 08 '10 at 22:50
3

Late to the game here, but I just used this expression and it seemed to work well for me. Seems to be fairly comprehensive and relatively simple:

var re = /^[A-zÀ-Ÿ\s\d-]*$/g; 
var str1 = 'casa-me,pois 99 estou farto! Eis a lista:uma;duas;três';
var str2 = 'casa-me pois 99 estou farto Eis a lista uma duas três';
var str3 = 'àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇߨøÅ寿œ'

alert(re.test(str1));
alert(re.test(str2));
alert(re.test(str3));
display name
  • 4,165
  • 2
  • 27
  • 52
Colin
  • 4,025
  • 21
  • 40
2

international characters listed are part of extended ASCII. the ones added by you are certainly not.

dusoft
  • 11,289
  • 5
  • 38
  • 44
2

Seeing as the statement works in the console, could this have to do the way your .js files are saved (i.e. ascii or UTF-8) and that the browser is loading them thusly and in the process translates the characters?

Colin
  • 10,630
  • 28
  • 36
  • JS doesn't know anything about UTF-8, even if the encoding is set so. – dusoft Jul 02 '09 at 09:47
  • But the browser does, doesn't it? What if the file is loaded as UTF-8 and the JS engine of the browser interprets the characters wrongly because the browser loaded the file incorrectly ? – Colin Jul 02 '09 at 09:51
  • 2
    Yep, the browser cares. If you save an "Ä" as not-Unicode, it will result in an invalid UTF-8 byte stream. Therefore, it never can match an UTF-8 byte stream corresponding to "Ä". – Boldewyn Jul 02 '09 at 09:53
  • s/browser cares/browser and hence the JS engine cares/ – Boldewyn Jul 02 '09 at 09:55
2

Use something like Fiddler or Charles (not Firebug's Net panel, or anything else that's actually inside the browser) to examine what's actually coming over the wire. It's almost certainly an encoding issue: either the file has been saved in some Microsoft character set and is being sent as UTF-8, or maybe the other way around.

In the case of JS RegExps you can, as Boldewyn points out, avoid these problems by specifying the Unicode code point for the characters you want that are outside the US-ASCII range. It would still be as well to make sure you aren't mixing up encodings between the place where the file is saved and the place where it's served, though.

NickFitz
  • 34,537
  • 8
  • 43
  • 40
  • gzip over the wire, so awkward to do – Jeff Atwood Jul 02 '09 at 10:36
  • Both Fiddler and Charles can deal with that. IIRC Fiddler (at least in version 2) will offer you a button in the Response viewing area to allow you to view the ungzipped content. – NickFitz Jul 02 '09 at 16:46