Remove all exclusive Latin characters using regex

Question

I'm developing a Portuguese software, so many of my entities have names like 'maça' or 'lição' and I want to use the entity as a resource key. So I want keep every character except the 'ç,ã,õ....'

There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):

Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();

only to emphasize, I'm worried just with Latin characters.

The title says "remove all latin characters", is that correct? What about `"abçã12#$%"`? — Kobi, Mar 16 '11 at 19:35
Just a correction of your terminology: All these Portuguese characters are actually part of the Latin character set. Your regex should work - what exactly is your question? — Mauritz Hansen, Mar 16 '11 at 19:48
@Hansen, in my regex some characters like 'Ç' are not replaced — Custodio, Mar 16 '11 at 19:50
Just a note - `\w` actually matches all Unicode letters, digits and underscores, not just the common ones. Same thing for `\d` - it matches all Unicode digits, including `٠١٢`, for example. `\W` and `\D` act the same, of course, and exclude all Unicode characters. That is why `\W` keeps `Ç` in your regex. — Kobi, Mar 16 '11 at 20:10

score 7 · Accepted Answer · edited Oct 15 '21 at 09:54

7

A simple option is to white-list the accepted characters:

string clean = Regex.Replace(messy, @"[^a-zA-Z0-9!@#]+", "");

If you want to remove all non-ASCII letters but keep all other characters, you can use character class subtraction:

string clean = Regex.Replace(messy, @"[\p{L}-[a-zA-Z]]+", "");

It can also be written as the more standard and complicated [^\P{L}a-zA-Z]+ (or \W), which reads "select all characters that are not word letters or ASCII letters", which ends up with the letters we're looking for.
Just some context for \W: It stands for "not a word character", meaning anything other than a-z,A-Z,0-9 and underscore _

You may also consider the following approach more useful: How do I remove diacritics (accents) from a string in .NET?

edited Oct 15 '21 at 09:54

George Dimitriadis

1,681
1
18
27

answered Mar 16 '11 at 19:43

Kobi

135,331
41
252
292

+1 because I'd never seen character class subtraction before. Holy smoke, that's useful. Is this only in .NET? – Justin Morgan - On strike Mar 16 '11 at 19:49
This is what I'm thinking @Kobi. The idea of all characters minus the Latin exclusive. – Custodio Mar 16 '11 at 19:52
@Justin - Thanks! It isn't .Net only (I've seen it elsewhere, IIRC, though I can't get it to work anywhere at the moment, so I may be wrong here), and it isn't so useful at all - this is the first time I ever considered using it. You could probably write it in another way with an alternation, or something like `(?![a-zA-z])\p{L}` (I'm probably missing the obvious option here...) – Kobi Mar 16 '11 at 19:54
@Justin - Update - according to http://www.regular-expressions.info/refflavors.html , it is mostly an XML feature, and isn't quite as common as I thought. – Kobi Mar 16 '11 at 20:05
The second suggestion don't remove space. But this is not clear in question. Success. – Custodio Mar 16 '11 at 20:17
@Luís - Thanks! It should be very easy to tweak the regex to add spaces - just adding `\s` at the right place `:)` – Kobi Mar 16 '11 at 20:35

score 5 · Answer 2 · edited Dec 18 '18 at 15:55

Another option might be to convert from Unicode to ASCII. This will not drop characters, but convert them to ?s. That might be better than dropping them (for use as keys).

string suspect = "lição";
byte[] suspectBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, Encoding.Unicode.GetBytes(suspect));
string purged = Encoding.ASCII.GetString(suspectBytes);
Console.WriteLine(purged); // li??o

Note that the question marks are often unique but unrepresentable characters, so you may get fewer collisions.

score 4 · Answer 3 · answered Mar 16 '11 at 19:41

4

Does this work?

Regex regex = new Regex(@"[^a-zA-Z0-9_]");

answered Mar 16 '11 at 19:41

Chris Haas

53,986
12
141
274

score 2 · Answer 4 · answered Mar 16 '11 at 19:53

2

I think the best regex would be to use:

[^\x00-\x80]

This is the negation of all ASCII characters. It matches all non-ASCII characters: The \x00 and \x80 (128) is the hexadecimal character code, and - means range. The ^ inside the [ and ] means negation.

Replace them with the empty string, and you should have what you want. It also frees you from worrying about punctuation, and the like, that are not ASCII, and can cause subtle but annoying (and hard to track down) errors.

If you want to use the extended ASCII set as legal characters, you can say \xFF instead of \x80.

answered Mar 16 '11 at 19:53

Ezra

7,552
1
24
28

But trying keep the code legible as possible, the x00 and x80 maybe be a fear point to future maintenance. – Custodio Mar 16 '11 at 20:09
@Luís - Consider adding a friendly comment, in that case, with a link here `:)` – Kobi Mar 16 '11 at 20:12

score 2 · Answer 5 · answered Mar 06 '12 at 13:12

The goal should be to simply include ASCII characters A-Z and numbers and punctuation. Just exclude everything outside of that range using RegEx.

string clean = Regex.Replace(messy, @"[^\x20-\x7e]", String.Empty);

To be clear, the regex I'm using is:

[^\x20-\x7e]

You may need to escape the \ character - I haven't tested this in anything but RegEx buddy :)

That excludes everything outside ASCII characters 0x20 and 0x7e, which translates to ASCII range decimal 32-127.

Good luck!

Best,

-Auri

score 0 · Answer 6 · answered Jul 22 '13 at 14:47

0

This is more usefull to me:

([\p{L}]+)

answered Jul 22 '13 at 14:47

Marcelo Rodovalho

880
1
15
26

Remove all exclusive Latin characters using regex

6 Answers6