35

Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
Smashery
  • 57,848
  • 30
  • 97
  • 128

4 Answers4

39

Use Regex Subtraction

[\p{P}-[._]]

See the .NET Regex documentation. I'm not sure if other flavors support it.

C# example

string pattern = @"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = @"_""'a:;%^&*~`bc!@#.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
    Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}

Explanation

The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}] and then adds a Subtraction Character Class like -[._], which says to remove the . and _. The subtraction is placed inside the [ ] after the standard class guts.

Michael
  • 8,362
  • 6
  • 61
  • 88
Les
  • 10,335
  • 4
  • 40
  • 60
  • That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation? – Smashery Oct 20 '10 at 00:50
  • If you drop the -[._], then \p{P} doesn't match them either. – Les Oct 20 '10 at 00:57
  • So .NET doesn't consider them to be punctuation? – Smashery Oct 20 '10 at 00:58
  • 3
    I am surprised that the grave accent is not considered punctuation. I suppose you need to define what you mean by punctuation. You can add the "symbol" character class (\p{S}) to pickup the accent, carat and tilde. I will edit my example. – Les Oct 20 '10 at 01:07
19

The answers so far do not respect ALL punctuation. This should work:

(?![\._])\p{P}

(Explanation: Negative lookahead to ensure that neither . nor _ are matched, then match any unicode punctuation character.)

Lucero
  • 59,176
  • 9
  • 122
  • 152
  • That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation? – Smashery Oct 20 '10 at 00:50
  • @Smashery These are accents, you would never use those as punctuation in the English language. – steinar Oct 20 '10 at 01:00
  • Thanks very much! I decided to accept Les's answer, because I find Regex Subtraction easier to understand conceptually; thus I'm more likely to remember it; but +1 - thanks for teaching me some new things! (Wish I could accept two answers) – Smashery Oct 20 '10 at 01:04
  • 1
    @Smashery - Even though the character class subtraction is easier to understand, be prepared to see this very common construct in Regex. The negative look ahead is used a lot. And it may be supported by more regex versions than Subtraction (my guess). – Les Jul 13 '12 at 18:42
9

Here is something a little simpler. Not words or white-space (where words include A-Za-z0-9 AND underscore).

[^\w\s.]
Ken Richards
  • 2,937
  • 2
  • 20
  • 22
1

You could possibly use a negated character class like this:

[^0-9A-Za-z._\s]

This includes every character except those listed. You may need to exclude more characters (such as control characters), depending on your ultimate requirements.

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • Okay, add space to the exclusion list. – Greg Hewgill Oct 19 '10 at 23:39
  • 4
    Would work on a limited set, but a lot of printable characters (currency symbols, mathematical symbols, diacritics etc.) are going to match this. – Wrikken Oct 20 '10 at 00:02
  • 7
    How about `º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ` etc. (you get the idea)? – Lucero Oct 20 '10 at 00:02
  • Can anyone explain why the full stop does not need escaping in this? Why isn't the full stop excluding every character? It doesn't - this works as described - I just don't understand the logic. This also seems to work as described if you do escape the full stop. IE, `Regex("[^a-zA-Z0-9\\.]").Replace("a_b:c-d.e 4\\5&6%c£7.","_")` returns `"a_b_c_d.e_4_5_6_c_7."`, as does `Regex("[^a-zA-Z0-9.]")` Better still does anyone have a decent RTFM link? – Chris Apr 13 '15 at 14:29
  • 1
    @Chris: The full stop does not need escaping there because full stop has no special meaning when inside `[]` brackets. For convenience, most regex parsers will allow you to escape it there anyway with no change in meaning. – Greg Hewgill Apr 13 '15 at 18:07