regular expressions with the Cyrillic alphabet?

Question

I am currently writing some validation that will validate inputted data. I am using regular expressions to do so, working with C#.

Password = @"(?!^[0-9]*$)(?!^[a-zA-Z]*$)^([a-zA-Z0-9]{6,18})$"

Validate Alpha Numeric = [^a-zA-Z0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]

The above work fine on the latin alphabet, but how can I expand such to working with the Cyrillic alphabet?

I dont know too much about regular expressions, how would I modify the above to include this? — amateur, Feb 16 '13 at 02:21

Sergey Kalinichenko · Accepted Answer · 2013-02-16T02:31:24.027

11

The basic approach to covering ranges of characters using regular expressions is to construct an expression of the form [A-Za-z], where A is the first letter of the range, and Z is the last letter of the range.

The problem is, there is no such thing as "The" Cyrillic alphabet: the alphabet is slightly different depending on the language. If you would like to cover Russian version of the Cyrillic, use [А-Яа-я]. You would use a different range, say, for Serbian, because the last letter in their Cyrillic is Ш, not Я.

Another approach is to list all characters one-by-one. Simply find an authoritative reference for the alphabet that you want to put in a regexp, and put all characters for it into a pair of square brackets:

[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]

edited Feb 16 '13 at 02:31

answered Feb 16 '13 at 02:25

Sergey Kalinichenko

714,442
84
1,110
1,523

+1. Good point on "no Cyrillic alphabet" - there are Cyrillic characters (@"\p{IsCyrillic}+") but if one need to limit to a given language explicit enumeration is the way to go. – Alexei Levenkov Feb 16 '13 at 02:44
Thanks for this - how would I add this to the regular expressions that I provided above? – amateur Feb 16 '13 at 17:03
@amateur Just like this - `[^a-zA-ZА-Яа-я0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]` – Sergey Kalinichenko Feb 16 '13 at 17:21
@dasblinkenlight the problem here is that you allowed some set of latin an cyrillic but then again don't support greek, hebrew, arabic, japanese, chinese, korean etc.. So I'd prefer Alexei Levenkov's solution if you don't need only specific characters but think about using your code worldwide. – ecth Aug 11 '16 at 08:25

score 9 · Answer 2 · answered Feb 16 '13 at 02:37

9

You can use character classes if you need to allow characters of particular language or particular type:

@"\p{IsCyrillic}+" // Cyrillic letters
@"[\p{Ll}\p{Lt}]+" // any upper/lower case letters in any language

In your case maybe "not a whitespace" would be enough: @"[^\s]+" or maybe "word character (which includes numbers and underscores) - @"\w+".

answered Feb 16 '13 at 02:37

Alexei Levenkov

98,904
14
127
179

+1 It's nice to know that there are a convenient character classes for detecting various native alphabets. – Sergey Kalinichenko Feb 16 '13 at 02:54
`[\p{Ll}\p{Lt}]` I think some character might be missing, but I don't know the exact different between "title case" and "upper case"... http://msdn.microsoft.com/en-us/library/20bw873z.aspx#SupportedUnicodeGeneralCategories – nhahtdh Feb 16 '13 at 07:16
1

Just a random note: `\p{IsCyrillic}` mean [Cyrillic **block**](http://msdn.microsoft.com/en-us/library/20bw873z.aspx#SupportedNamedBlocks) in C#, but it will mean [Cyrillic **script**](http://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode) (containing many blocks) in Java. – nhahtdh Feb 16 '13 at 07:25

score 1 · Answer 3 · answered Feb 16 '13 at 02:22

1

Password = @"(?!^[0-9]*$)(?!^[А-Яа-я]*$)^([А-Яа-я0-9]{6,18})$"

Validate Alpha Numeric = [^а-яА-Я0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]

answered Feb 16 '13 at 02:22

KJW

15,035
47
137
243

regular expressions with the Cyrillic alphabet?

3 Answers3