3

Even though I stumbled over this because a PHP regex I wrote failed to match it in the way I expected, I'm not sure if this is the right place to ask. After all, the definition in PHP (and probably other Unicode-aware regex engines) seems to match the official categorization (cf. e.g. https://www.fileformat.info/info/unicode/char/201e/index.htm) and it is this official categorization I am unhappy with.

According to this, the DOUBLE LOW-9 QUOTATION MARK is categorized as Ps (therefore matched by /\p{Ps}/) and, in spite of its very name, not as Pi (initial quotation mark), for which is used in German. It didn't even make it into the less specific 'Punctuation, Initial quote (may behave like Ps or Pe depending on usage)' category. What could be the reason for this (mis)categorization? In what languages is it actually used as a Ps (i.e., similar to "(" or "[" or "{")?

But most importantly: What is a suitable regex that covers all kinds of quotation marks across all languages without enumerating too many individual codepoints?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Hagen von Eitzen
  • 2,109
  • 21
  • 25
  • Maybe it would work if you had a characater class set up like `[\p{Pi}\x{201E}]` for example? Not sure if there would be other quotation marks missing from `Pi`. – JvdV Jun 04 '20 at 13:05
  • Also, a list of any such quotation marks used as brackets could be found [here](https://stackoverflow.com/a/13535289/9758194) if you want to dig through it. A quick scan tells me that there are several that fall under `Ps`, `Pe` and `Pf` – JvdV Jun 04 '20 at 13:16
  • My guess would be that it's used as initial punctuation ***in German***, but not necessarily everywhere else. Perhaps its use is too ambiguous to be generally classifiable as one or the other. – deceze Jun 04 '20 at 13:30
  • @deceze Dutch used the „ too in the past, although we seem to have switched to “ later. – Mr Lister Jun 04 '20 at 20:24
  • 1
    `/[\x{FE41}-\x{FE44}\x{FF02}\x{FF07}\x{FF62}-\x{FF63}\x{2018}-\x{201F}\x{2039}-\x{203A}\x{300C}-\x{300F}\x{301D}-\x{301F}\x{2E42}\x{22}\x{27}\x{AB}\x{BB}]/u` covers all codepoints with `Quotation_Mark` = Y – nj_ Jun 06 '20 at 11:00
  • @nj_ Seems legit but ignores the "without enumerating too many individual code points" part. I might still use it in my project as a defined constant, thanks – Hagen von Eitzen Jun 06 '20 at 15:58

1 Answers1

3

The general categories Pi (Initial_Punctuation) and Pf (Final_Punctuation) are not used exclusively for quotation marks, just like Ps (Open_Punctuation) and Pe (Close_Punctuation) are not used exclusively for characters that aren’t quotation marks. Rather, Pi and Pf are used for pairs of characters where either one can be opening or closing depending on usage, whereas Ps characters are always opening and Pe characters are always closing (ignoring rare or specialised cases). Which of these general categories a character belongs to is based on these considerations and has nothing to do with whether it is a quotation mark, a bracket or something else.

U+201E DOUBLE LOW-9 QUOTATION MARK is categorised as Ps because there is no established orthography in the world where it can be used as a closing mark. It is always opening in practice. In contrast, U+201C LEFT DOUBLE QUOTATION MARK is categorised as Pi because it can be both an opening and a closing quote depending on which specific style of quotes you chose.

Unicode has a dedicated property for identifying quotation marks appropriately named Quotation_Mark. This property is defined independently from the general category values previously discussed.

CharlotteBuff
  • 3,389
  • 1
  • 16
  • 18
  • Sounds like a weird rationale. "This is not always at the *end*, it is sometimes at the *start*, so we better call it *final*"?? /rant -- But more importantly for applications: It seems that the `Quotation_mark` property is not available as a shorthand in common regex engines? – Hagen von Eitzen Jun 06 '20 at 15:56