Even though I stumbled over this because a PHP regex I wrote failed to match it in the way I expected, I'm not sure if this is the right place to ask. After all, the definition in PHP (and probably other Unicode-aware regex engines) seems to match the official categorization (cf. e.g. https://www.fileformat.info/info/unicode/char/201e/index.htm) and it is this official categorization I am unhappy with.
According to this, the DOUBLE LOW-9 QUOTATION MARK is categorized as Ps
(therefore matched by /\p{Ps}/
) and, in spite of its very name, not as Pi
(initial quotation mark), for which is used in German. It didn't even make it into the less specific 'Punctuation, Initial quote (may behave like Ps
or Pe
depending on usage)' category. What could be the reason for this (mis)categorization? In what languages is it actually used as a Ps
(i.e., similar to "("
or "["
or "{"
)?
But most importantly: What is a suitable regex that covers all kinds of quotation marks across all languages without enumerating too many individual codepoints?