How to approximate Java's Character.isLetterOrDigit() to identify non-English letters, digits in Javascript?

Question

In Javascript, is there a way (that survives internationalization) to determine whether a character is a letter or digit? That will correctly identify Ä, ç as letters, and non-English digits (which I am not going to look up as examples)!

In Java, the Character class has some static methods .isLetter(), .isDigit(), .isLetterOrDigit(), for determining in an internationally suitable way that a char is actually a letter or digit. This is better than code like

//this is not right, but common and easy
if((ch>='A'&&ch<='Z')||(ch>='a'&&ch<='z')) { //it's a letter

because it will pick up non-English letters. I think C# has similar capabilities...

Of course, at worst I can send strings back to the server to be checked but that's a pain...

Of course, in the end I am looking to check if input is a valid name (starts with a letter, the rest is letter or digit). An outside the box possibility for low volume use might be:

var validName=function(atr) {
    var ele=document.createElement("div");
    try { ele.setAttribute(atr,"xxx"); }
    catch(e) { return false; }
    return true;
    }

This tests out fairly decent in IE, FF and Chrome... Though thorough testing might be needed to figure out how consistent the answers are. And again, not appropriate for heavy duty usage due to element creation.

sadly, Javascript regex /w (to match a word character) thinks that Ä is a non-word character. In Chrome and FF at least — jwl, Sep 03 '10 at 20:41
This related question http://stackoverflow.com/questions/1073412/javascript-validation-issue-with-international-characters seems to indicate there is not really a true solution other than trying to list the characters you are going to pretend are not digits and letters... i hope someone knows better! — jwl, Sep 03 '10 at 20:46

score 3 · Accepted Answer · edited Dec 18 '14 at 09:28

I have created a small Javascript utility to provide this functionality. I don't claim it is perfect, so let me know how you fair. If people like it, I'll make this the official answer to this question.

CharFunk: https://github.com/joelarson4/CharFunk

CharFunk.getDirectionality(ch) - Used to find the directionality of the character
CharFunk.isAllLettersOrDigits(string) - Returns true if the string argument is composed of all letters and digits
CharFunk.isDigit(ch) - Returns true if provided a length 1 string that is a digit
CharFunk.isLetter(ch) - Returns true if provided a length 1 string that is a letter
CharFunk.isLetterNumber(ch) - Returns true if provided a length 1 string that is in the Unicode "Nl" category
CharFunk.isLetterOrDigit(ch) - Returns true if provided a length 1 string that is a letter or a digit
CharFunk.isLowerCase(ch) - Returns true if provided a length 1 string that is lowercase
CharFunk.isMirrored(ch) - Returns true if provided a length 1 string that is a mirrored character
CharFunk.isUpperCase(ch) - Returns true if provided a length 1 string that is uppercase
CharFunk.isValidFirstForName(ch) - Returns true if provided a length 1 string that is a valid leading character for a JavaScript identifier
CharFunk.isValidMidForName(ch) - Returns true if provided a length 1 string that is a valid non-leading character for a ECMAScript identifier
CharFunk.isValidName(string,checkReserved) - Returns true if the string is a valid ECMAScript identifier
CharFunk.isWhitespace(ch) - Returns true if provided a length 1 string that is a whitespace character
CharFunk.indexOf(string,callback) - Returns first matching index that returns a true return from the callback
CharFunk.lastIndexOf(string,callback) - Returns last matching index that returns a true return from the callback
CharFunk.matchesAll(string,callback) - Returns true if all characters in the provided string result in a true return from the callback
CharFunk.replaceMatches(string,callback,ch) - Returns a new string with all matched characters replaced

I took the liberty to paste your source code into your answer, as this is *way* easier for me (and probably others as well) to view it. I hope you don't mind. Looks nice BTW, but I think this could be more optimized. I'll let you know if I found a more efficient algorithm. — Marcel Korpel, Sep 06 '10 at 22:23
I am sure it is not the most optimal solution, but she works! — jwl, Sep 07 '10 at 12:42

score 1 · Answer 2 · answered Sep 06 '10 at 15:22

1

As far as I could tell when faced with a similar problem, the only way was really picking a couple of blocks and assume those are letters. The unicode standard has the full lists, so you could build a complete regex for this (I think). For instance, if you take all characters that are "alphabetic" according to this list you probably have all alphabetic characters. Likewise for numeric (decimal, digit, numeric) in the main unicode data file.

I'm not entirely sure if I'm pointing in the correct direction. There's a bunch of Unicode code charts that might help, and there's of course the unicode standard itself. It's all a bit much to read and understand though, especially if your only goal is to do some javascript string verification.

answered Sep 06 '10 at 15:22

wds

31,873
11
59
84

I looked into this direction. I figured that the easiest way to implement might be to write a Java program that would cycle through codepoints and output a list of all the codepoint ranges for each type (letter, digit, or neither). it appears that these ranges are very small though, and i ended up with 590 separate ranges for chars between 0 and 65535. of course, I am also not sure if the Javascript codepoints will 100% match with what is in Java depending on browser, system setup, etc... i18n is a deep pit! – jwl Sep 06 '10 at 16:37
@larson4 ah I wish I thought of that. The code points are the actual unicode codepoints (encoding independent) AFAIK, so they should work fine. – wds Sep 07 '10 at 07:43

How to approximate Java's Character.isLetterOrDigit() to identify non-English letters, digits in Javascript?

2 Answers2