14

How can I match all the “special” chars (like +_*&^%$#@!~) except the char - in PHP?

I know that \W will match all the “special” chars including the -.

Any suggestions in consideration of Unicode letters?

tchrist
  • 78,834
  • 30
  • 123
  • 180
CaTz
  • 305
  • 3
  • 5
  • 20

3 Answers3

52
  • [^-] is not the special character you want
  • [\W] are all special characters as you know
  • [^\w] are all special characters as well - sounds fair?

So therefore [^\w-] is the combination of both: All "special" characters but without -.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • works as well, thank you, from some reason its not matches the _ char... but i managed to bypass that, – CaTz Mar 15 '12 at 20:14
  • What is it that you think that `[\W]` does that `\W` does not? – tchrist Mar 15 '12 at 23:50
  • @tchrist: What do you mean, I don't think that. – hakre Mar 16 '12 at 00:16
  • Why would you write brackets around a single character class abbreviation? – tchrist Mar 16 '12 at 00:22
  • 1
    You seem to have misclassified things like `_` as ***non-special***, things like `àéüîøçñ` as ***half-special***, and things like `‾ΑΒK5` as ***special***. That makes no sense at all. – tchrist Mar 16 '12 at 00:31
  • @tchrist: Keep it lightly: The line-up is a (don't know the word in english) so that you can combine the patterns of each of them to the result, so it's easy to understand. It's not written in the form that would pose the absolute minimalistic count of characters to express an equally same pattern BUT with the intend to express a guideline how to find the character class OP was looking for. Your other sentence I don't understand and I wish you would limit charsets you use to the C-locale to keep things lightly. – hakre Mar 16 '12 at 00:44
  • Your answer doesn’t work correctly for Unicode, per the OP’s requirement. – tchrist Mar 16 '12 at 00:57
  • @tchrist: I didn't address your Unicode concerns specifically back in 2012, PCRE has Unicode support of the \w character class by the options. From the docs: *"Matching characters by Unicode property is not fast, because PCRE has to do a multistage table lookup in order to find a character's property. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE by default, though you can make them do so by setting the PCRE_UCP option or by starting the pattern with (*UCP)."* - https://www.pcre.org/original/doc/html/pcrepattern.html (maybe not back then) – hakre Jan 05 '20 at 22:11
6
  • \pL matches any character with the Unicode Letter character property, which is a major general category group; that is, it matches [\p{Ll}\p{Lt}\p{Lu}\p{Lm}\p{Lo}].
  • \pN matches any character with the Unicode Number character property, which is a major general category group; that is, it matches [\p{Nd}\p{Nl}\p{No}].
  • Note that the Unicode Alphabetic characterproperty also includes certain combining marks such as U+0345 ◌ͅ ᴄᴏᴍʙɪɴɪɴɢ ɢʀᴇᴇᴋ ʏᴘᴏɢᴇɢʀᴀᴍᴍᴇɴɪ. I suggest you that you also include \pM, which matches any character with the Unicode Mark character property, which is a major general category group; that is, it matches [\p{Mn}\p{Me}\p{Mc}].
  • Character U+002D ʜʏᴘʜᴇɴ-ᴍɪɴᴜꜱ is probably the - you’re referring to.
  • Note though that Unicode v6.1 has 27 characters with the Unicode Dash character property, including such common characters as U+2010 ʜʏᴘʜᴇɴ, U+2013 ᴇɴ ᴅᴀꜱʜ, U+2014 ᴇᴍ ᴅᴀꜱʜ, and U+2212 ᴍɪɴᴜꜱ ꜱɪɢɴ. Whether you actually want to include or exclude those, I have no idea.

Given all that, it is not unlikely that you want something like:

[^\pL\pN\pM\x2D\x{2010}-\x{2015}\x{2212}]
tchrist
  • 78,834
  • 30
  • 123
  • 180
4

You can try this pattern

([^a-zA-Z-])

This should match all characters that are not a-z and the -

Austin Brunkhorst
  • 20,704
  • 6
  • 47
  • 61
  • its not good, because there can be letters in unicode... anyway, found the answer! [^\p{L}-\d] – CaTz Mar 15 '12 at 20:07
  • Considering that you were very broad with your question, there was no specific scope of characters set, so this is my assumption. – Austin Brunkhorst Mar 15 '12 at 20:09