French/Portuguese extended ASCII symbols in regex

Question

I need to write an edit control mask that should accept [a-zA-Z] letters as well as extended French and Portuguese symbols like [ùàçéèçÇµ]. The mask should accept both uppercase and lowercase symbols. If found two suggestions:

[\p{L}]

and

[a-zA-Z0-9\u0080-\u009F]

What is the correct way to write such a regular expression?

Update: My question is about forming a regexp that should match (not filter) French and Portuguese characters in order to display it in the edit control. Case insensitive solution won't help me. [\p{L}] seems to be a Unicode character class, I need an ASCII regexp. Digits are allowed, but special characters such as !@#$%^&*)_+}{|"?>< are disallowed (should be filtered).

I found the most working variant is [a-zA-Z0-9\u00B5-\u00FF]

https://regex101.com/r/EPF1rg/2

The question is why the range for [ùàçéèçÇµ] is \u00B5-\u00FF and not \u0080-\u009F ? As I see from CP860 (Portuguese code page) and from CP863 (French code page) it should be in range \u0080-\u009F.

https://www.ascii-codes.com/cp860.html

Can anyone explain it?

Possible duplicate of [Regex accent insensitive?](https://stackoverflow.com/questions/6664582/regex-accent-insensitive) — Cee McSharpface, Jul 20 '17 at 09:22
what about the µ at the end of your sample string? I don't think that it would ever equal m even if accent insensitive? — Cee McSharpface, Jul 20 '17 at 09:24
I don't need it to be case insensitive. I need a regexp that matches µ symbol if the user enters it from French layout keyboard. — Vadym Lenda, Jul 20 '17 at 10:37
C# doesn't use ASCII so there are no ASCII regexes. (There is no one thing called extended ASCII so the use of this term is almost always inadequate.) (Keyboard layouts are an issue between the user and the operating system. Perhaps you are taking too much on board the application, perhaps not.) — Tom Blodget, Jul 20 '17 at 10:42
ok based on your last edit, this is no longer a duplicate of the accent sensitivity stuff. the beauty about Unicode is that it no longer maps the same codepoints to different characters, as previously with all the ANSI codepages, that had to share the upper half of the 256 places. [\u00B5 is the µ character](http://www.fileformat.info/info/unicode/char/00b5/index.htm) all right. Unicode does not care about what it was in CP860 or any other legacy codepage. — Cee McSharpface, Jul 20 '17 at 10:45
Try applying [`String.Normalize`](https://msdn.microsoft.com/en-us/library/ebza6ck1(v=vs.110).aspx) before matching. — Tom Blodget, Jul 20 '17 at 10:46

score 1 · Accepted Answer · answered Jul 20 '17 at 11:05

The characters [µùàçéèçÇ] are in range \u00B5-\u00FF, because the Unicode standard says so. The "old" range (\u0080-\u009F as in the 860 portugese code page) was just one of many possible mappings of the available 128 extended characters in ANSI, where you would sometimes find the same character at different codepoints depending on codepage).

C# strings are unicode, and so are its regex features: https://stackoverflow.com/a/20641460/1132334

If you really must specify a fixed range of characters, in C# you can just as well include them literally:

[a-zA-Z0-9µùàçéèçÇ]

Or, as others have suggested already, use the "letter" matching. So it won't be up to you to define what a letter is in each alphabet, and you don't need to keep up with future changes of that definition yourself:

\p{L}

A third valid option could be to invert the specification and name only the punctuation characters and control characters that you would not allow.

French/Portuguese extended ASCII symbols in regex

1 Answers1