34

I want to match a string to make sure it contains only letters.

I've got this and it works just fine:

var onlyLetters = /^[a-zA-Z]*$/.test(myString);

BUT

Since I speak another language too, I need to allow all letters, not just A-Z. Also for example:

é ü ö ê å ø

does anyone know if there is a global 'alpha' term that includes all letters to use with regExp? Or even better, does anyone have some kind of solution?

Thanks alot

EDIT: Just realized that you might also wanna allow '-' and ' ' incase of a double name like: 'Mary-Ann' or 'Mary Ann'

patad
  • 9,364
  • 11
  • 38
  • 44
  • 3
    The [a-zA-Z] thing works because the letters and numbers are consecutive ASCII codes, so unless there's a built in function in your language's implementation of Regex or the special characters are consecutive in your string encoding, chances are you'll have to just write them all out. – Ed James Jan 06 '10 at 14:15
  • maybe I should do the opposit: check if the string does NOT contain any digits or special characters like * - . uhh prob wont work anyway since ø prob is counted to the special characters darn – patad Jan 06 '10 at 14:21
  • What characters count as letter? Examples: $, €, æ, ʩ – GvS Jan 06 '10 at 14:23
  • @Isabell: That is the answer I posted. It is known as a blacklist (I do not know your knowlage level). – Hazior Jan 06 '10 at 14:36
  • I like your question - gives us our answers without fuss. – Lucas Jul 13 '12 at 00:50

12 Answers12

35

I don’t know the actual reason for doing this, but if you want to use it as a pre-check for, say, login names oder user nicknames, I’d suggest you enter the characters yourself and don’t use the whole ‘alpha’ characters you’ll find in unicode, because you probably won’t find an optical difference in the following letters:

А ≠ A ≠ Α  # cyrillic, latin, greek

In such cases it’s better to specify the allowed letters manually if you want to minimise account faking and such.

Addition

Well, if it’s for a field which is supposed to be non-unique, I would allow greek as well. I wouldn’t feel well when I force users into changing their name to a latinised version.

But for unique fields like nicknames you need to give your other visitors of the site a hint, that it’s really the nickname they think it is. Bad enough that people will fake accounts with interchanging I and l already. Of course, it’s something that depends on your users; but to be sure I think it’s better to allow basic latin + diacritics only. (Maybe have a look at this list: Latin-derived_alphabet)

As an untested suggestion (with ‘-’, ‘_’ and ‘ ’):

/^[a-zA-Z\-_ ’'‘ÆÐƎƏƐƔIJŊŒẞÞǷȜæðǝəɛɣijŋœĸſßþƿȝĄƁÇĐƊĘĦĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊIJĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịijĵķƙĸĺļłľŀʼnńn̈ňñņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ]$/.test(myString)

Another edit: I have added the apostrophe for people with names like O’Neill or O’Reilly. (And the straight and the reversed apostrophe for people who can’t enter the curly one correctly.)

Debilski
  • 66,976
  • 12
  • 110
  • 133
  • 1
    good point. it's for a form and the Name input. come to think about it, I have seen loads of "choose a username (A-Z 0-9 - .)" then if ur greek, I guess ur just unlucky :-p – patad Jan 06 '10 at 14:34
  • wow look at that! looks like u managed to catch all werid characters ever made :-p and it works great! awesome job! thanks for that! – patad Jan 06 '10 at 15:37
  • 2
    I'm positive that regex can be improved somewhat by using character ranges. Something like: `[A-Za-zÀ-ÿ]` would catch all the ASCII letters. Check http://en.wikipedia.org/wiki/List_of_Unicode_characters for a full list. – DisgruntledGoat Jan 08 '10 at 12:28
  • But between ‘À’ and ‘ÿ’ there is ‘×’ and ‘÷’ which you might want to exclude. Nonetheless, if ranges work also for unicode characters, one could just include the ranges of Latin Extended-A and Extended-B and the Basic Latin stuff. – Debilski Jan 08 '10 at 14:47
  • @Debilski, your totally right, ‘×’ and ‘÷’ are not accepted. This is the one I choose: /^[a-zA-Z\- ÅåÄäÖöØøÆæÉéÈèÜüÊêÛûÎî]*$/ – patad Jan 08 '10 at 21:26
  • @Debilski This was added about a year after you answered: http://stackoverflow.com/a/18391901/759452 what about using this piece of code to remove remove accents/diacritics ? – Adriano Sep 16 '14 at 14:44
18
var onlyLetters = /^[a-zA-Z\u00C0-\u00ff]+$/.test(myString)
Spudley
  • 166,037
  • 39
  • 233
  • 307
Corey
  • 181
  • 1
  • 2
11

You can't do this in JS. It has a very limited regex and normalizer support. You would need to construct a lengthy and unmaintainable character array with all possible latin characters with diacritical marks (I guess there are around 500 different ones). Rather delegate the validation task to the server side which uses another language with more regex capabilties, if necessary with help of ajax.

In a full fledged regex environment you could just test if the string matches \p{L}+. Here's a Java example:

boolean valid = string.matches("\\p{L}+");

Alternatively, you could also normailze the text to get rid of the diacritical marks and check if it contains [A-Za-z]+ only. Here's again a Java example:

string = Normalizer.normalize(string, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
boolean valid = string.matches("[A-Za-z]+");

PHP supports similar functions.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • this solution seems pretty good though http://stackoverflow.com/a/18391901/759452 , what's your opinion? – Adriano Sep 16 '14 at 14:30
  • Regarding your other point "In a full fledged regex environment ..." this polyfill may do the job https://github.com/slevithan/xregexp , note that I am not discussing the fact that validation should definitely be happening on server side too (I'd use JS validation just as a "luxury" feature to lower the number of calls to the server). – Adriano Sep 16 '14 at 14:37
9

When I tried to implement @Debilski's solution JavaScript didn't like the extended Latin characters -- I had to code them as JavaScript escapes:

// The huge unicode escape string is equal to ÆÐƎƏƐƔIJŊŒẞÞǷȜæðǝəɛɣijŋœĸſßþƿȝĄƁÇĐƊĘĦ
// ĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎ
// ƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊ
// IJĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịijĵķƙĸĺļłľŀʼnńn̈ňñ
// ņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭ
// ŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ

function isAlpha(string) {
    var patt = /^[a-zA-Z\u00C6\u00D0\u018E\u018F\u0190\u0194\u0132\u014A\u0152\u1E9E\u00DE\u01F7\u021C\u00E6\u00F0\u01DD\u0259\u025B\u0263\u0133\u014B\u0153\u0138\u017F\u00DF\u00FE\u01BF\u021D\u0104\u0181\u00C7\u0110\u018A\u0118\u0126\u012E\u0198\u0141\u00D8\u01A0\u015E\u0218\u0162\u021A\u0166\u0172\u01AFY\u0328\u01B3\u0105\u0253\u00E7\u0111\u0257\u0119\u0127\u012F\u0199\u0142\u00F8\u01A1\u015F\u0219\u0163\u021B\u0167\u0173\u01B0y\u0328\u01B4\u00C1\u00C0\u00C2\u00C4\u01CD\u0102\u0100\u00C3\u00C5\u01FA\u0104\u00C6\u01FC\u01E2\u0181\u0106\u010A\u0108\u010C\u00C7\u010E\u1E0C\u0110\u018A\u00D0\u00C9\u00C8\u0116\u00CA\u00CB\u011A\u0114\u0112\u0118\u1EB8\u018E\u018F\u0190\u0120\u011C\u01E6\u011E\u0122\u0194\u00E1\u00E0\u00E2\u00E4\u01CE\u0103\u0101\u00E3\u00E5\u01FB\u0105\u00E6\u01FD\u01E3\u0253\u0107\u010B\u0109\u010D\u00E7\u010F\u1E0D\u0111\u0257\u00F0\u00E9\u00E8\u0117\u00EA\u00EB\u011B\u0115\u0113\u0119\u1EB9\u01DD\u0259\u025B\u0121\u011D\u01E7\u011F\u0123\u0263\u0124\u1E24\u0126I\u00CD\u00CC\u0130\u00CE\u00CF\u01CF\u012C\u012A\u0128\u012E\u1ECA\u0132\u0134\u0136\u0198\u0139\u013B\u0141\u013D\u013F\u02BCN\u0143N\u0308\u0147\u00D1\u0145\u014A\u00D3\u00D2\u00D4\u00D6\u01D1\u014E\u014C\u00D5\u0150\u1ECC\u00D8\u01FE\u01A0\u0152\u0125\u1E25\u0127\u0131\u00ED\u00ECi\u00EE\u00EF\u01D0\u012D\u012B\u0129\u012F\u1ECB\u0133\u0135\u0137\u0199\u0138\u013A\u013C\u0142\u013E\u0140\u0149\u0144n\u0308\u0148\u00F1\u0146\u014B\u00F3\u00F2\u00F4\u00F6\u01D2\u014F\u014D\u00F5\u0151\u1ECD\u00F8\u01FF\u01A1\u0153\u0154\u0158\u0156\u015A\u015C\u0160\u015E\u0218\u1E62\u1E9E\u0164\u0162\u1E6C\u0166\u00DE\u00DA\u00D9\u00DB\u00DC\u01D3\u016C\u016A\u0168\u0170\u016E\u0172\u1EE4\u01AF\u1E82\u1E80\u0174\u1E84\u01F7\u00DD\u1EF2\u0176\u0178\u0232\u1EF8\u01B3\u0179\u017B\u017D\u1E92\u0155\u0159\u0157\u017F\u015B\u015D\u0161\u015F\u0219\u1E63\u00DF\u0165\u0163\u1E6D\u0167\u00FE\u00FA\u00F9\u00FB\u00FC\u01D4\u016D\u016B\u0169\u0171\u016F\u0173\u1EE5\u01B0\u1E83\u1E81\u0175\u1E85\u01BF\u00FD\u1EF3\u0177\u00FF\u0233\u1EF9\u01B4\u017A\u017C\u017E\u1E93]+$/;
    return patt.test(string);
}
Ben Y
  • 1,711
  • 1
  • 25
  • 37
7

There should be, but the regex will be localization dependent. Thus, é ü ö ê å ø won't be filtered if you're on a US localization, for example. To ensure your web site does what you want across all localizations, you should explicitly write out the characters in a form similar to what you are already doing.

The only standard one I am aware of though is \w, which would match all alphanumeric characters. You could do it the "standard" way by running two regex, one to verify \w matches and another to verify that \d (all digits) does not match, which would result in a guaranteed alpha-only string. Again, I'd strongly urge you not to use this technique as there's no guarantee what \w will represent in a given localization, but this does answer your question.

David Pfeffer
  • 38,869
  • 30
  • 127
  • 202
7

This can be tricky, unfortunately JavaScript has pretty poor support for internationalization. To do this check you'll have to create your own character class. This is because for instance, \w is the same as [0-9A-Z_a-z] which won't help you much and there isn't anything like [[:alpha:]] in Javascript. But since it sounds like you're only going to use one other langauge you can probably just add those other characters into your character class.

By the way, I think you'll need a ? or * in your regexp there if myString can be longer than one character.

The full example,

/^[a-zA-Zéüöêåø]*$/.test(myString);

aDev
  • 314
  • 2
  • 5
6

I don't know anything about Javascript, but if it has proper unicode support, convert your string to a decomposed form, then remove the diacritics from it ([\u0300-\u036f\u1dc0-\u1dff]). Then your letters will only be ASCII ones.

Virgil Dupras
  • 2,634
  • 20
  • 22
  • This won't work because some of his letters are not just diacritical ASCII. `ø` for example was mentioned, and this isn't the diacritic of `o` as far as I know. – David Pfeffer Jan 06 '10 at 14:20
  • 1
    Hum, yeah. But if he's going to enumerate all valid characters, doing this diacritic tricks is going to save him quite a few enumerations, even if he has to specify `ø` separately. – Virgil Dupras Jan 06 '10 at 14:30
6

You could aways use a blacklist instead of a whitelist. That way you only remove the characters you do not need.

Hazior
  • 696
  • 4
  • 10
  • 26
  • never heard of it but it sort of speaks for itself. don't u just check weather it does not contain this that etc? – patad Jan 06 '10 at 14:42
  • A blacklist is is excluding what you do not need. A whitelist is only allowing what you need. Blacklists are used when you only want to ban certain characters like / or <. – Hazior Jan 06 '10 at 14:45
  • so do you declare a blacklist in a special way or is it just a regular regexp saying "does not contain" instead of does? – patad Jan 06 '10 at 14:50
  • http://www.hendricom.com/forums/index.php?showtopic=2282 ^ is the blacklist symbol though. – Hazior Jan 06 '10 at 15:07
  • 2
    That blacklist would need to be pretty long to be sensible. – Debilski Jan 06 '10 at 15:08
  • if the blacklist symbol is ^ how come /^[a-zA-Zéüöêåø]*$/.test(myString) returns false when myString contains digits? shouldn't it be the other way around then? uhh nvm :-p – patad Jan 06 '10 at 15:10
  • You only need to blacklist the symbols that you dont want them typing. It doesn't have to be long. But whitelist is the best coding practice in my opinion. – Hazior Jan 06 '10 at 15:15
  • @Isabell: Since you said nvm is it safe you assume you figured it out? – Hazior Jan 06 '10 at 15:15
  • If the character set is UTF-16 the blacklist would need to be about 65k long! – Pool Jan 06 '10 at 18:51
  • @The Feast: So say you just want to blacklist "'" It would be 65k long? Maybe to their optimal solution it may be large but you could also do a combination of whitelisting/blacklisting. – Hazior Jan 06 '10 at 18:56
4

You could use a blacklist - a list of characters to exclude.

Also, it is important to verify the input on server-side, not only on client-side! Client-side can be bypassed easily.

Frunsi
  • 7,099
  • 5
  • 36
  • 42
3
var regexp = /\B\#[a-zA-Z\x7f-\xff]+/g; 
var result = searchText.match(regexp);
  • 1
    While this code snippet may solve the question, [including an explanation](//meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations! – kayess Jul 13 '17 at 13:15
2

I'm using a convertor before checking, but it's still not friendly for all languages. I'm not sure that's possible.

function noExtendedChars( input_name ){

    var whitelist = [
        ['a',  'à','á','â','ä','æ','ã','å','ā'],
        ['c',  'ç', 'ć', 'č'],
        ['e',  'è','é','ê','ë','ē','ė','ę'],
        ['i',  'ï','ï','í','ī','į','î'],
        ['l',  'ł'],
        ['n',  'ñ', 'ń'],
        ['o',  'ô', 'ö', 'ò', 'ó', 'œ', 'ø', 'ō', 'õ' ],
        ['s',  'ß', 'ś', 'š' ],
        ['u',  'û', 'ü', 'ù', 'ú', 'ū'],
        ['y',  'ÿ'],
        ['z',  'ž', 'ź', 'ż']
        ];

    for( b=0; b < blacklist.length; b++ ){
        var r=  blacklist[b];
        for ( a=1; a < r.length; a++ ){
            input_name = input_name.replace( new RegExp( r[a], "gi") , r[0]);
        }
    }
    return input_name;

}
Joeri
  • 2,214
  • 24
  • 24
2

There are some shortcuts to achive this in other regular expression dialects - see this page. But I don't believe there are any standardised ones in JavaScript - certainly not that would be supported by all browsers.

David M
  • 71,481
  • 13
  • 158
  • 186