6

In Javascript, is there a way (that survives internationalization) to determine whether a character is a letter or digit? That will correctly identify Ä, ç as letters, and non-English digits (which I am not going to look up as examples)!

In Java, the Character class has some static methods .isLetter(), .isDigit(), .isLetterOrDigit(), for determining in an internationally suitable way that a char is actually a letter or digit. This is better than code like

//this is not right, but common and easy
if((ch>='A'&&ch<='Z')||(ch>='a'&&ch<='z')) { //it's a letter

because it will pick up non-English letters. I think C# has similar capabilities...

Of course, at worst I can send strings back to the server to be checked but that's a pain...

Of course, in the end I am looking to check if input is a valid name (starts with a letter, the rest is letter or digit). An outside the box possibility for low volume use might be:

var validName=function(atr) {
    var ele=document.createElement("div");
    try { ele.setAttribute(atr,"xxx"); }
    catch(e) { return false; }
    return true;
    }

This tests out fairly decent in IE, FF and Chrome... Though thorough testing might be needed to figure out how consistent the answers are. And again, not appropriate for heavy duty usage due to element creation.

jwl
  • 10,268
  • 14
  • 53
  • 91
  • sadly, Javascript regex /w (to match a word character) thinks that Ä is a non-word character. In Chrome and FF at least – jwl Sep 03 '10 at 20:41
  • This related question http://stackoverflow.com/questions/1073412/javascript-validation-issue-with-international-characters seems to indicate there is not really a true solution other than trying to list the characters you are going to pretend are not digits and letters... i hope someone knows better! – jwl Sep 03 '10 at 20:46
  • 1
    What is "non English digits" supposed to include? – NullUserException Sep 06 '10 at 21:24

2 Answers2

3

I have created a small Javascript utility to provide this functionality. I don't claim it is perfect, so let me know how you fair. If people like it, I'll make this the official answer to this question.

CharFunk: https://github.com/joelarson4/CharFunk

  • CharFunk.getDirectionality(ch) - Used to find the directionality of the character
  • CharFunk.isAllLettersOrDigits(string) - Returns true if the string argument is composed of all letters and digits
  • CharFunk.isDigit(ch) - Returns true if provided a length 1 string that is a digit
  • CharFunk.isLetter(ch) - Returns true if provided a length 1 string that is a letter
  • CharFunk.isLetterNumber(ch) - Returns true if provided a length 1 string that is in the Unicode "Nl" category
  • CharFunk.isLetterOrDigit(ch) - Returns true if provided a length 1 string that is a letter or a digit
  • CharFunk.isLowerCase(ch) - Returns true if provided a length 1 string that is lowercase
  • CharFunk.isMirrored(ch) - Returns true if provided a length 1 string that is a mirrored character
  • CharFunk.isUpperCase(ch) - Returns true if provided a length 1 string that is uppercase
  • CharFunk.isValidFirstForName(ch) - Returns true if provided a length 1 string that is a valid leading character for a JavaScript identifier
  • CharFunk.isValidMidForName(ch) - Returns true if provided a length 1 string that is a valid non-leading character for a ECMAScript identifier
  • CharFunk.isValidName(string,checkReserved) - Returns true if the string is a valid ECMAScript identifier
  • CharFunk.isWhitespace(ch) - Returns true if provided a length 1 string that is a whitespace character
  • CharFunk.indexOf(string,callback) - Returns first matching index that returns a true return from the callback
  • CharFunk.lastIndexOf(string,callback) - Returns last matching index that returns a true return from the callback
  • CharFunk.matchesAll(string,callback) - Returns true if all characters in the provided string result in a true return from the callback
  • CharFunk.replaceMatches(string,callback,ch) - Returns a new string with all matched characters replaced
Benjamin Podszun
  • 9,679
  • 3
  • 34
  • 45
jwl
  • 10,268
  • 14
  • 53
  • 91
  • I took the liberty to paste your source code into your answer, as this is *way* easier for me (and probably others as well) to view it. I hope you don't mind. Looks nice BTW, but I think this could be more optimized. I'll let you know if I found a more efficient algorithm. – Marcel Korpel Sep 06 '10 at 22:23
  • I am sure it is not the most optimal solution, but she works! – jwl Sep 07 '10 at 12:42
1

As far as I could tell when faced with a similar problem, the only way was really picking a couple of blocks and assume those are letters. The unicode standard has the full lists, so you could build a complete regex for this (I think). For instance, if you take all characters that are "alphabetic" according to this list you probably have all alphabetic characters. Likewise for numeric (decimal, digit, numeric) in the main unicode data file.

I'm not entirely sure if I'm pointing in the correct direction. There's a bunch of Unicode code charts that might help, and there's of course the unicode standard itself. It's all a bit much to read and understand though, especially if your only goal is to do some javascript string verification.

wds
  • 31,873
  • 11
  • 59
  • 84
  • I looked into this direction. I figured that the easiest way to implement might be to write a Java program that would cycle through codepoints and output a list of all the codepoint ranges for each type (letter, digit, or neither). it appears that these ranges are very small though, and i ended up with 590 separate ranges for chars between 0 and 65535. of course, I am also not sure if the Javascript codepoints will 100% match with what is in Java depending on browser, system setup, etc... i18n is a deep pit! – jwl Sep 06 '10 at 16:37
  • @larson4 ah I wish I thought of that. The code points are the actual unicode codepoints (encoding independent) AFAIK, so they should work fine. – wds Sep 07 '10 at 07:43