1

I'm looking for a maximum number of unicode combining characters that appear after a non-combining one in a realistic natural text.

I know that in unicode text there can be an arbitrary number of combinings placed anywhere in the text. However, I am writing a specialized application that has to operate under constrained resources and because of that and other technical reasons displaying an arbitrary number of combining chars after a non-combining one is not an option. However I would still like to display natural languages properly if possible and support for a small number of combinings should not be a problem.

My intuition that natural languages don't need more than some two or three combinings after a proper char, but I'm not sure and can't find any source on that number.

kralyk
  • 4,249
  • 1
  • 32
  • 34
  • 2
    Greek needs three; for example alpha with iota subscript, circumflex accent and soft breathing: `ᾆ`. I don't know of any writing system which needs more than three diacritics on the same letter. – AlexP May 10 '18 at 12:25
  • @AlexP Thanks for the mention. Looks like after normalization that character can be just one Unicode character (`GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI`) unless I'm wrong... – kralyk May 10 '18 at 13:20
  • 1
    That depends on what normalization you use (NFC or NFD), and what normalization you use depends on the purpose of normalization. For example, for purposes of searching strings in the text you would normally normalize to NFD then discard all combining diacritics. And anyway, there are no (Ancient) Greek keyboards so the character must needs be entered (at least partially) decomposed. – AlexP May 10 '18 at 13:22
  • 1
    Apparently it's possible to have 8 combining characters in Tibetan: https://stackoverflow.com/a/11983435/1607043 – DPenner1 May 10 '18 at 21:34

1 Answers1

0

Ok, for a lack of a better answer, here's what I did (for future reference if needed):

I ended up using a SmallVec -like thing with a threshold of 8 bytes before allocation and some 50 bytes upper limit (text stored in UTF-8). That should make everyone happy I think and performance doesn't suffer.

Take those numbers with a pinch of salt, they are arbitrary and I might tune them anyway.

kralyk
  • 4,249
  • 1
  • 32
  • 34