Why does \p{P} when used in
(^|\p{P})(?!,alpha,),*alpha,*
behave differently from \p{Ps} used in
(^|\p{Ps})(?!,alpha,),*alpha,*
when used to process
(,alpha,
\p{P} matches whereas \p{Ps} does not match

- 9
- 1
-
The problem is with \p{Po}. For example, `(^|[\p{P}-[\p{Po}]])(?!,alpha,),*alpha,*` works as wanted. But I would still be interested in knowing why. – asr Oct 18 '20 at 11:01
-
See the [list of chars matched with `\p{Ps}`](https://www.fileformat.info/info/unicode/category/Ps/list.htm). All cateogries list: https://www.fileformat.info/info/unicode/category/index.htm – Wiktor Stribiżew Oct 18 '20 at 11:11
1 Answers
Because \p{P}
matches the ,
in your string, not (
.
This is because you have a negative lookahead (?!,alpha,)
after \p{P}
. This means that "after 'any punctuation', there must not be the string ,alpha,
". Well, There is ,alpha,
after (
, so \p{P}
fails to match (
. The regex engine moves forward one character, and tries again. This time, \p{P}
matches ,
and there is no ,alpha,
after ,
(there is only alpha,
!), and the rest of the match succeeds too, so the whole match succeeds. The matched string is ,alpha,
, without the (
.
If you change the \p{P}
to \p{Ps}
, it will fail to match (
just like before, but also fail to match ,
, causing the whole match the fail. Note that the ^
alternative doesn't get chosen, because even though the lookahead passes, your regex requires a ,
to immediately follow. But after the start of string, there is a (
instead.

- 213,210
- 22
- 193
- 313