10

I'm developing a Portuguese software, so many of my entities have names like 'maça' or 'lição' and I want to use the entity as a resource key. So I want keep every character except the 'ç,ã,õ....'

There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):

Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();

only to emphasize, I'm worried just with Latin characters.

Community
  • 1
  • 1
Custodio
  • 8,594
  • 15
  • 80
  • 115
  • 1
    The title says "remove all latin characters", is that correct? What about `"abçã12#$%"`? – Kobi Mar 16 '11 at 19:35
  • What about "abc", all Latin characters. – Tergiver Mar 16 '11 at 19:40
  • my bad @Kobi, I changed the title – Custodio Mar 16 '11 at 19:44
  • Just a correction of your terminology: All these Portuguese characters are actually part of the Latin character set. Your regex should work - what exactly is your question? – Mauritz Hansen Mar 16 '11 at 19:48
  • @Hansen, in my regex some characters like 'Ç' are not replaced – Custodio Mar 16 '11 at 19:50
  • Just a note - `\w` actually matches all Unicode letters, digits and underscores, not just the common ones. Same thing for `\d` - it matches all Unicode digits, including `٠١٢`, for example. `\W` and `\D` act the same, of course, and exclude all Unicode characters. That is why `\W` keeps `Ç` in your regex. – Kobi Mar 16 '11 at 20:10

6 Answers6

7

A simple option is to white-list the accepted characters:

string clean = Regex.Replace(messy, @"[^a-zA-Z0-9!@#]+", "");

If you want to remove all non-ASCII letters but keep all other characters, you can use character class subtraction:

string clean = Regex.Replace(messy, @"[\p{L}-[a-zA-Z]]+", "");

It can also be written as the more standard and complicated [^\P{L}a-zA-Z]+ (or \W), which reads "select all characters that are not word letters or ASCII letters", which ends up with the letters we're looking for.
Just some context for \W: It stands for "not a word character", meaning anything other than a-z,A-Z,0-9 and underscore _

You may also consider the following approach more useful: How do I remove diacritics (accents) from a string in .NET?

George Dimitriadis
  • 1,681
  • 1
  • 18
  • 27
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • +1 because I'd never seen character class subtraction before. Holy smoke, that's useful. Is this only in .NET? – Justin Morgan - On strike Mar 16 '11 at 19:49
  • This is what I'm thinking @Kobi. The idea of all characters minus the Latin exclusive. – Custodio Mar 16 '11 at 19:52
  • @Justin - Thanks! It isn't .Net only (I've seen it elsewhere, IIRC, though I can't get it to work anywhere at the moment, so I may be wrong here), and it isn't so useful at all - this is the first time I ever considered using it. You could probably write it in another way with an alternation, or something like `(?![a-zA-z])\p{L}` (I'm probably missing the obvious option here...) – Kobi Mar 16 '11 at 19:54
  • @Justin - Update - according to http://www.regular-expressions.info/refflavors.html , it is mostly an XML feature, and isn't quite as common as I thought. – Kobi Mar 16 '11 at 20:05
  • The second suggestion don't remove space. But this is not clear in question. Success. – Custodio Mar 16 '11 at 20:17
  • @Luís - Thanks! It should be very easy to tweak the regex to add spaces - just adding `\s` at the right place `:)` – Kobi Mar 16 '11 at 20:35
5

Another option might be to convert from Unicode to ASCII. This will not drop characters, but convert them to ?s. That might be better than dropping them (for use as keys).

string suspect = "lição";
byte[] suspectBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, Encoding.Unicode.GetBytes(suspect));
string purged = Encoding.ASCII.GetString(suspectBytes);
Console.WriteLine(purged); // li??o

Note that the question marks are often unique but unrepresentable characters, so you may get fewer collisions.

Liam
  • 27,717
  • 28
  • 128
  • 190
Tergiver
  • 14,171
  • 3
  • 41
  • 68
4

Does this work?

Regex regex = new Regex(@"[^a-zA-Z0-9_]");
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
2

I think the best regex would be to use:

[^\x00-\x80]

This is the negation of all ASCII characters. It matches all non-ASCII characters: The \x00 and \x80 (128) is the hexadecimal character code, and - means range. The ^ inside the [ and ] means negation.

Replace them with the empty string, and you should have what you want. It also frees you from worrying about punctuation, and the like, that are not ASCII, and can cause subtle but annoying (and hard to track down) errors.

If you want to use the extended ASCII set as legal characters, you can say \xFF instead of \x80.

Ezra
  • 7,552
  • 1
  • 24
  • 28
  • But trying keep the code legible as possible, the x00 and x80 maybe be a fear point to future maintenance. – Custodio Mar 16 '11 at 20:09
  • @Luís - Consider adding a friendly comment, in that case, with a link here `:)` – Kobi Mar 16 '11 at 20:12
2

The goal should be to simply include ASCII characters A-Z and numbers and punctuation. Just exclude everything outside of that range using RegEx.

string clean = Regex.Replace(messy, @"[^\x20-\x7e]", String.Empty);

To be clear, the regex I'm using is:

[^\x20-\x7e]

You may need to escape the \ character - I haven't tested this in anything but RegEx buddy :)

That excludes everything outside ASCII characters 0x20 and 0x7e, which translates to ASCII range decimal 32-127.

Good luck!

Best,

-Auri

Auri Rahimzadeh
  • 2,133
  • 15
  • 21
0

This is more usefull to me:

([\p{L}]+)
Marcelo Rodovalho
  • 880
  • 1
  • 15
  • 26