106

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?

Thomas Anderson
  • 1,977
  • 7
  • 17
  • 22

6 Answers6

230

The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning. You can escape the hyphen inside a character class, but you don’t need to.

Thus:

  • - matches a hyphen.
  • [-] matches a hyphen.
  • [abc-] matches a, b, c or a hyphen.
  • [-abc] matches a, b, c or a hyphen.
  • [ab\-c] matches a, b, c or a hyphen.
  • [ab-d] matches a, b, c or d (only here does the hyphen denote a character range).
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • [Hyphen is taken literally from inside a character class, if it cannot form a range](http://stackoverflow.com/questions/29458636/how-does-this-pattern-match-hyphen-without-escape). – MAKZ Apr 06 '15 at 08:33
  • Note: If you use the hex code for hyphen \x2D it will still see it as denoting a character range. (only tested in JavaScript) has anyone else found this? – MarkP Nov 21 '15 at 17:38
  • 2
    @MarkP Well, duh: character hex codes are converted *by the front-end parser* (of C#, or JavaScript, or whatever language you’re using) into the actual character. So using hex codes is the same as using the actual characters as far as the value of the string is concerned. – Konrad Rudolph Nov 21 '15 at 18:21
  • @Puck No, there's no need for the brackets. However, as the answer states, the dash **must be last**. In particular, the parentheses do not do what you're expecting; you need to remove them (and the space as well!). – Konrad Rudolph Apr 26 '16 at 11:36
  • 2
    @Pshemo Of course, stupid mistake. Regarding the interpretation in `[a-c-e]`: this is simply invalid in some regex specifications/engines. POSIX regex for instance disallows it. – Konrad Rudolph Nov 02 '16 at 17:32
  • `[%--]` matches any character between `%` and `-` (inclusive). `[--@]` matches any character between `-` and `@` (inclusive). The notation `[%--@]` is invalid as it is ambiguous. So a hyphen only is a hyphen if it is first or last in the group. So this is valid `[--@%--]` but this is not `[%----@]` – kvantour Oct 20 '21 at 09:12
78

Escape the hyphen.

[a-zA-Z0-9!$* \t\r\n\-]

UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.

Community
  • 1
  • 1
Neil Barnwell
  • 41,080
  • 29
  • 148
  • 220
  • Oh is it? Is that because it's in a character group? My bad. – Neil Barnwell Nov 01 '10 at 12:11
  • 11
    @KonradRudolph You are correct, but I am not sure the unescaped version is easier to understand. The two possible usages of dash are confusing, this is why there are questions about this to begin with. It is certainly more elegant once you know about it, but for beginners it is a bit confusing. – Christophe Roussy Jul 15 '14 at 13:02
  • `Escape the hyphen.` I think this answer is misleading and should be deleted. As @KonradRudolph said: **make it first or last char in the character class; otherwise it has no special meaning**. For this staying here will keep the misinformation alive for careless or fast moving engineers – Ahmet Jan 24 '23 at 14:48
15

It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.

But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.

This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.

All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.

I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • I tend to agree with this answer, the less you need to know the safer the code. This reminds me of problems operator priorities: http://stackoverflow.com/questions/10007140/operator-precedence-and-ternary-operator, I perfer having parentheses in them (automatically added by my IDE), no need to know them all. You or someone else may mess up sooner or later. Of course if you work a lot with regex in your projects you may require to have more advanced knowledge. – Christophe Roussy Jul 15 '14 at 12:54
7

[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.

Parimala
  • 71
  • 1
  • 4
5

use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.

Radu Simionescu
  • 4,518
  • 1
  • 35
  • 34
3

Is this what you are after?

MatchCollection matches = Regex.Matches(mystring, "-");
Aliostad
  • 80,612
  • 21
  • 160
  • 208