Possible combining character sequences in Unicode

Question

There are some characters which are not included in Unicode(i.e. accented Cyrillic letters) but can be created using combining sequences. As I understand the possible combining character sequences are defined in the layout engine and/or font used. Am I right? So, how to get all the possible combining sequences?

Accented Cyrillic letters **are** included in Unicode, just not as a predefined composite characters. — Sebastian Negraszus, Jan 21 '13 at 13:36
What would you do with such a list of all possible combining sequences? Also: it would likely be *very, very* big (not endless unless you start applying the same combining character multiple times). — Joachim Sauer, Jan 21 '13 at 14:53
@JoachimSauer: Allowing to apply every available combining character (currently, that's 1645!) to a single base character, but disallowing to apply one of them twice would be a strange restriction, though :) — Sebastian Negraszus, Jan 21 '13 at 15:02
@Sebastian Negraszus, saying that Accented Cyrillic letters are not included in Unicode I mean they are missing in the character repertoire(Univeral Character Set), that is, there are no code points corresponding them. — andrew, Jan 21 '13 at 16:02

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

You are correct in that attempting to create arbitrary combining sequences may fail for a combination of layout engine and font. A solution to this problem is outside the remit of the Unicode standard.

From Unicode 6.2, chapter 2:

All combining characters can be applied to any base character and can, in principle, be used with any script. As with other characters, the allocation of a combining character to one block or another identifies only its primary usage; it is not intended to define or limit the range of characters to which it may be applied. In the Unicode Standard, all sequences of character codes are permitted.

This does not create an obligation on implementations to support all possible combinations equally well. Thus, while application of an Arabic annotation mark to a Han character or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering or to make much sense.

score 1 · Answer 2 · answered Jan 21 '13 at 14:43

1

It depends on your specific layout engine, whether and how you can query if a certain Unicode character sequence is displayable.

answered Jan 21 '13 at 14:43

Sebastian Negraszus

11,915
7
43
70

score 1 · Answer 3 · answered Jan 22 '13 at 08:37

The set of possible combining character sequences in Unicode is literally infinite (though only enumerably infinite), because a combining character may appear after any character, including a combining character. Sometimes you see people play in StackOverflow with this, using a character with a long string of combining characters after it.

So the list would be infinite. It can be generated automatically, but it would not be of much use.

Accented Cyrillic characters are included in Unicode, just not as precomposed characters. In Unicode, an accented Cyrillic character is simply two Unicode code points in succession.

The quality of presentation depends on the font(s) used and on the rendering engine. As a rule, new software can handle simple cases like я́ (Cyrillic letter ya with acute) well, but old software may have simplistic rendering routines that misplace the diacritic at times. Quality rendering requires that the software accesses information about the dimensions of the base character and places the diacritic accordingly.

It is important that the diacritic is taken from the same font as the base character. “Cross-font” combinations tend to produce poor or awful results. So you should first check which fonts contain the combining acute U+0301, and then select the font among the remaining candidates.

Unicode has the concept of “named character sequence”. Informally speaking, it can be used to give some identity and “charactehood” to a sequence like a letter followed by combinining mark, when the combination does not exist as a precomposed character. The motivation given is: “Such a generalized notation for sequences of Unicode code points is often useful in discursive text. More formally, other standards may need to refer to entities that are represented in Unicode by sequences of characters. Mapping tables may map single characters in other standards to sequences of Unicode characters, and listings of repertoire coverage for fonts or keyboards may need to reference entities that do not correspond to single Unicode code points.” However, the concept has not become very popular, and the current registry does not contain any sequences with a Cyrillic character as the base.

Saying that Accented Cyrillic letters are not included in Unicode I mean they are missing in the character repertoire(Univeral Character Set), that is, there are no code points corresponding them. — andrew, Jan 22 '13 at 10:36

Possible combining character sequences in Unicode

3 Answers3

Linked