1

Some opening punctuation characters (Unicode General Category Ps) and opening quote characters (Unicode General Category Pi) happen to have their appropriate closing character at the very next codepoint. For example, ( is U+0028 and ) is U+0029. Similarly, is U+27EA and is U+27EB. But there are exceptions, such as « (U+00AB), which has its matching character, », sixteen code points away at at U+00BB.

Given an opening character, how can I determine the appropriate closing character?

(I've tagged this question python because I ultimately want to accomplish this in Python, but a language-neutral answer is fine, too.)

Edit: Thanks for pointing me to List of all unicode's open/close brackets?. In particular, this answer shows the pairs of brackets (i.e., Ps and Pe characters). But the question of finding a matching quote character (i.e., Pi and Pf characters) that doesn't happen to be a mirror image, like for , seems to be left open.

Kodiologist
  • 2,984
  • 18
  • 33
  • 1
    Possibly a duplicate of https://stackoverflow.com/questions/13535172/list-of-all-unicodes-open-close-brackets/13535350 ? I started writing an answer, but realized it would mostly be saying the same things as the answers there. – hobbs Jul 18 '17 at 22:24
  • @hobbs - ditto :D Thanks for finding it out. – zwer Jul 18 '17 at 22:25
  • 2
    Possible duplicate of [List of all unicode's open/close brackets?](https://stackoverflow.com/questions/13535172/list-of-all-unicodes-open-close-brackets) – zwer Jul 18 '17 at 22:25
  • @hobbs Please see my edit. – Kodiologist Jul 19 '17 at 05:13
  • 1
    The problem with `Pi` etc. is that their use is ambiguous. For example, in English you usually use double quotes “like this”, whereas in German it's quite common to use them „like this“, so the `Pi` can sometimes opening, sometimes closing here. – lenz Jul 19 '17 at 11:09
  • @lenz We can still say that whenever `“` is an opening character, then `”` is the corresponding closing character, can't we? That's all I need here. – Kodiologist Jul 19 '17 at 14:17
  • Sure. Depending on the approach, you could even add both pairs and mark them as alternatives. Combine it with a heuristic that picks the one with fewer "left-overs". – lenz Jul 19 '17 at 14:33
  • Related: [quotation marks: Analysis and explanation](https://stackoverflow.com/a/41496756/3439404). – JosefZ Jul 19 '17 at 16:05

1 Answers1

0

As I mentioned in the edit to the question, the Unicode data file BidiBrackets.txt shows all the matching bracket characters, where the opening character is Ps. As for quote characters Pi, there aren't too many of these, so I just found what looked like the most obvious closing character by hand:

« »
‘ ’
‛ ’
“ ”
‹ ›
⸂ ⸃
⸄ ⸅
⸉ ⸊
⸌ ⸍
⸜ ⸝
⸠ ⸡
Kodiologist
  • 2,984
  • 18
  • 33