3

I have a working regex that matches ASCII alphanumeric characters:

 string pattern = "^[a-zA-Z0-9]+$";
 Match match = Regex.Match(input, pattern);
 if (match.Success)
 {
   ...

I want to extend this to apply the same concept, but include all latin characters (e.g. å, Ø etc).

I've read about unicode scripts. And I've tried this:

 string pattern = "^[{Latin}0-9]+$";

But it's not matching the patterns I expect. How do I match latin unicode using unicode scripts or an alternative method?

mtmacdonald
  • 14,216
  • 19
  • 63
  • 99
  • Possible duplicate of [Regex Latin characters filter and non latin character filer](http://stackoverflow.com/questions/29948341/regex-latin-characters-filter-and-non-latin-character-filer) – revo Apr 10 '17 at 15:57

3 Answers3

5

Unicode scripts are not supported by .NET regex engine but Unicode blocks are. Having that said, you are able to match all latin characters using below regex:

^[\p{IsBasicLatin}\p{IsLatin-1Supplement}\p{IsLatinExtended-A}\p{IsLatinExtended-B}0-9]+$
  • \p{IsBasicLatin}: U+0000–U+007F
  • \p{IsLatin-1Supplement}: U+0080–U+00FF
  • \p{IsLatinExtended-A}: U+0100–U+017F
  • \p{IsLatinExtended-B}: U+0180–U+024F

or simply use ^[\u0000-\u024F0-9]+$.

Mentioned by @AnthonyFaull you may want to consider matching \p{IsLatinExtendedAdditional} as well which is a named block for U+1E00-U+1EFF that contains 256 additional characters:

[ắẮằẰẵẴẳẲấẤầẦẫẪẩẨảẢạ ẠặẶậẬḁḀ ẚ ḃḂḅḄḇḆ ḉḈ ḋḊḑḐḍḌḓḒḏḎ ẟ ếẾềỀễỄểỂẽẼḝḜḗḖḕḔẻẺẹẸ ệỆḙḘḛḚ ḟḞ ḡḠ ḧḦḣḢḩḨḥḤḫḪẖ ḯḮỉỈịỊḭḬ ḱḰḳḲḵḴ ḷḶḹḸḽḼḻḺ ỻỺ ḿḾṁṀṃṂ ṅṄṇṆṋṊṉṈ ốỐồỒỗỖổỔṍṌṏṎṓṒṑṐỏỎớỚ ờỜỡỠởỞợỢọỌộỘ ṕṔṗṖ ṙṘṛṚṝṜṟṞ ṥṤṧṦṡṠṣṢṩṨẛ ẞ ẜ ẝ ẗṫṪṭṬṱṰṯṮ ṹṸṻṺủỦứỨừỪữỮửỬựỰụỤṳṲ ṷṶṵṴ ṽṼṿṾ ỽỼ ẃẂẁẀẘẅẄẇẆẉẈ ẍẌẋẊ ỳỲẙỹỸẏẎỷỶỵỴ ỿỾ ẑẐẓẒẕẔ]
revo
  • 47,783
  • 14
  • 74
  • 117
2

Use ^[\p{L}\s]+$ to match any unicode character

Or ^[\w\u00c0-\u017e]$ to match any letter plus unicode characters from 00c0 to 017e (use charmap to find unicode characters range you need)

Sample on regex101

Stephane Janicaud
  • 3,531
  • 1
  • 12
  • 18
1

I will use unicode scripts.

As describe by Wikipedia (https://en.wikipedia.org/wiki/Latin_script_in_Unicode), I will use Latin-1 Supplement (00C0-00FF), Latin Extended-A (0100–017F), Latin Extended-B (0180–024F) and your pattern for ASCII alphanumeric characters.

string pattern = "^[a-zA-Z0-9\\u00C0–\\u024F]+$";
Michael Schmidt
  • 110
  • 1
  • 14