Regex: Match any punctuation character except . and _

Question

Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.

score 39 · Accepted Answer · edited Dec 19 '22 at 22:56

39

Use Regex Subtraction

[\p{P}-[._]]

See the .NET Regex documentation. I'm not sure if other flavors support it.

C# example

string pattern = @"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = @"_""'a:;%^&*~`bc!@#.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
    Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}

Explanation

The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}] and then adds a Subtraction Character Class like -[._], which says to remove the . and _. The subtraction is placed inside the [ ] after the standard class guts.

edited Dec 19 '22 at 22:56

Michael

8,362
6
61
88

answered Oct 20 '10 at 00:17

Les

10,335
4
40
60

That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation? – Smashery Oct 20 '10 at 00:50
If you drop the -[._], then \p{P} doesn't match them either. – Les Oct 20 '10 at 00:57
So .NET doesn't consider them to be punctuation? – Smashery Oct 20 '10 at 00:58
3

I am surprised that the grave accent is not considered punctuation. I suppose you need to define what you mean by punctuation. You can add the "symbol" character class (\p{S}) to pickup the accent, carat and tilde. I will edit my example. – Les Oct 20 '10 at 01:07

score 19 · Answer 2 · answered Oct 19 '10 at 23:57

19

The answers so far do not respect ALL punctuation. This should work:

(?![\._])\p{P}

(Explanation: Negative lookahead to ensure that neither . nor _ are matched, then match any unicode punctuation character.)

answered Oct 19 '10 at 23:57

Lucero

59,176
9
122
152

That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation? – Smashery Oct 20 '10 at 00:50
@Smashery These are accents, you would never use those as punctuation in the English language. – steinar Oct 20 '10 at 01:00
Thanks very much! I decided to accept Les's answer, because I find Regex Subtraction easier to understand conceptually; thus I'm more likely to remember it; but +1 - thanks for teaching me some new things! (Wish I could accept two answers) – Smashery Oct 20 '10 at 01:04
1

@Smashery - Even though the character class subtraction is easier to understand, be prepared to see this very common construct in Regex. The negative look ahead is used a lot. And it may be supported by more regex versions than Subtraction (my guess). – Les Jul 13 '12 at 18:42

score 9 · Answer 3 · answered Oct 19 '10 at 23:43

9

Here is something a little simpler. Not words or white-space (where words include A-Za-z0-9 AND underscore).

[^\w\s.]

answered Oct 19 '10 at 23:43

Ken Richards

2,937
2
20
22

score 1 · Answer 4 · edited Oct 19 '10 at 23:41

1

You could possibly use a negated character class like this:

[^0-9A-Za-z._\s]

This includes every character except those listed. You may need to exclude more characters (such as control characters), depending on your ultimate requirements.

edited Oct 19 '10 at 23:41

Abe Miessler

82,532
99
305
486

answered Oct 19 '10 at 23:27

Greg Hewgill

951,095
183
1,149
1,285

Okay, add space to the exclusion list. – Greg Hewgill Oct 19 '10 at 23:39
4

Would work on a limited set, but a lot of printable characters (currency symbols, mathematical symbols, diacritics etc.) are going to match this. – Wrikken Oct 20 '10 at 00:02
7

How about `º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ` etc. (you get the idea)? – Lucero Oct 20 '10 at 00:02
Can anyone explain why the full stop does not need escaping in this? Why isn't the full stop excluding every character? It doesn't - this works as described - I just don't understand the logic. This also seems to work as described if you do escape the full stop. IE, `Regex("[^a-zA-Z0-9\\.]").Replace("a_b:c-d.e 4\\5&6%c£7.","_")` returns `"a_b_c_d.e_4_5_6_c_7."`, as does `Regex("[^a-zA-Z0-9.]")` Better still does anyone have a decent RTFM link? – Chris Apr 13 '15 at 14:29
1

@Chris: The full stop does not need escaping there because full stop has no special meaning when inside `[]` brackets. For convenience, most regex parsers will allow you to escape it there anyway with no change in meaning. – Greg Hewgill Apr 13 '15 at 18:07

Regex: Match any punctuation character except . and _

4 Answers4

Use Regex Subtraction

C# example

Explanation

Linked