Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.
4 Answers
Use Regex Subtraction
[\p{P}-[._]]
See the .NET Regex documentation. I'm not sure if other flavors support it.
C# example
string pattern = @"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = @"_""'a:;%^&*~`bc!@#.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}
Explanation
The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}]
and then adds a Subtraction Character Class like -[._]
, which says to remove the .
and _
. The subtraction is placed inside the [ ]
after the standard class guts.
-
That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation? – Smashery Oct 20 '10 at 00:50
-
If you drop the -[._], then \p{P} doesn't match them either. – Les Oct 20 '10 at 00:57
-
So .NET doesn't consider them to be punctuation? – Smashery Oct 20 '10 at 00:58
-
3I am surprised that the grave accent is not considered punctuation. I suppose you need to define what you mean by punctuation. You can add the "symbol" character class (\p{S}) to pickup the accent, carat and tilde. I will edit my example. – Les Oct 20 '10 at 01:07
The answers so far do not respect ALL punctuation. This should work:
(?![\._])\p{P}
(Explanation: Negative lookahead to ensure that neither .
nor _
are matched, then match any unicode punctuation character.)

- 59,176
- 9
- 122
- 152
-
That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation? – Smashery Oct 20 '10 at 00:50
-
@Smashery These are accents, you would never use those as punctuation in the English language. – steinar Oct 20 '10 at 01:00
-
Thanks very much! I decided to accept Les's answer, because I find Regex Subtraction easier to understand conceptually; thus I'm more likely to remember it; but +1 - thanks for teaching me some new things! (Wish I could accept two answers) – Smashery Oct 20 '10 at 01:04
-
1@Smashery - Even though the character class subtraction is easier to understand, be prepared to see this very common construct in Regex. The negative look ahead is used a lot. And it may be supported by more regex versions than Subtraction (my guess). – Les Jul 13 '12 at 18:42
Here is something a little simpler. Not words or white-space (where words include A-Za-z0-9 AND underscore).
[^\w\s.]

- 2,937
- 2
- 20
- 22
You could possibly use a negated character class like this:
[^0-9A-Za-z._\s]
This includes every character except those listed. You may need to exclude more characters (such as control characters), depending on your ultimate requirements.

- 82,532
- 99
- 305
- 486

- 951,095
- 183
- 1,149
- 1,285
-
-
4Would work on a limited set, but a lot of printable characters (currency symbols, mathematical symbols, diacritics etc.) are going to match this. – Wrikken Oct 20 '10 at 00:02
-
7How about `º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ` etc. (you get the idea)? – Lucero Oct 20 '10 at 00:02
-
Can anyone explain why the full stop does not need escaping in this? Why isn't the full stop excluding every character? It doesn't - this works as described - I just don't understand the logic. This also seems to work as described if you do escape the full stop. IE, `Regex("[^a-zA-Z0-9\\.]").Replace("a_b:c-d.e 4\\5&6%c£7.","_")` returns `"a_b_c_d.e_4_5_6_c_7."`, as does `Regex("[^a-zA-Z0-9.]")` Better still does anyone have a decent RTFM link? – Chris Apr 13 '15 at 14:29
-
1@Chris: The full stop does not need escaping there because full stop has no special meaning when inside `[]` brackets. For convenience, most regex parsers will allow you to escape it there anyway with no change in meaning. – Greg Hewgill Apr 13 '15 at 18:07