0

I'm looking at formatting a utf8 free text string to fit an exact column width on a terminal. I'm coding various truncation methods (left/middle/right) for long strings however, when the truncation break point lies over a wide character, such as an emoji, the display column counting falls apart. some form of padding is needed for the 'half wide' column placement.

Is there a suitable narrow character to show that indicates we do have valid unicode character, but insufficient display space to show it, as opposed to the special replacement character � usually used for invalid unicode ??

Example: on a fixed spacing terminal fit two smiley emojis into the space that would fit 'aaa'. e.g. "" ! so need a, preferably standardised, substitute character for the second emoji/wide character, e.g. "⋮" to fit that three wide space.

A side issue is trying to work out when decomposed composite characters start and end, (also are there combining prefixes?). It looks like the next code point needs to be read to see if it is still zero width (e.g. 'o' U+006F, then 'umlaut' U+0308, rather than ö U+00F6; don't stop after the plain 'o').

Philip Oakley
  • 13,333
  • 9
  • 48
  • 71
  • 1
    Common use: the Unicode 3 points, or an arrow. And usually they are written in grey (instead of black). So not standardized, and with a check for possible names, I do not find any character in Unicode index which has exact your semantic – Giacomo Catenazzi Dec 07 '22 at 11:29
  • @GiacomoCatenazzi I had considered the vertical ellipsis Unicode Character “⋮” (U+22EE) as one option, as it 'stands' for something, rather than the conventional horizontal ellipsis which is 'between', or 'joins'. – Philip Oakley Dec 07 '22 at 11:35
  • Please [edit] your question to provide a [mcve]. Moreover, clarify your specific problem or provide additional details to highlight exactly what you need. As currently written, it's hard to tell exactly what you're asking. – JosefZ Dec 07 '22 at 19:52
  • For decades [MS Excel displays `#` when a column for a number is too narrow](https://support.microsoft.com/en-us/office/how-to-correct-a-error-bf801d0a-2a6e-44bd-a70e-0f780ae8f11e) with the intention to better show that than letting the reader misread an incomplete number. If you mean **combining** characters then there's no limit for those - [read more](https://stackoverflow.com/q/10414864/4299358). – AmigoJack Dec 08 '22 at 00:46
  • @JosefZ The problem is that it's not "producable" - on a fixed spacing terminal fit two smiley emojis into the space that would fit 'aaa'. "" ! so need a, preferably standardised, substitute character for the second emoji/wide character "⋮" – Philip Oakley Dec 09 '22 at 12:27
  • For me, `"'aaaa'"; "''"` shows the same _optical_ width in Windows terminal (font `Cascadia Mono`, ratio `2:1`) while in *Windows PowerShell ISE* (font `Unifont`) I need to add one more `a` as follows `"'aaaaa'"; "''"` (however in ISE the _optical_ width is rather different, approximately `13:5`). – JosefZ Dec 09 '22 at 15:26
  • I would not expect such a character to exist in Unicode. I would expect it to be considered a visual representation issue, which is explicitly not part of code points. The width of a character is a font-specific issue, which Unicode does not address, so a code point for "there wasn't enough space to layout the font glyph" doesn't fit Unicode's goals. (BTW, if you're looking for interesting characters to use to test your layout engine w/ fixed-width fonts, one I always include is , which is not always a multiple of the "fixed" character width.) – Rob Napier Jan 15 '23 at 04:01
  • @RobNapier The widths are, as best I understand, indicated in unicode e.g. http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt where ";W" is the wide chars. https://github.com/depp provides part of the decoding (inc all the combing chars), and it's then used in [Git](https://github.com/git/git/commit/9c94389c3ee02) so lots of magic.. – Philip Oakley Jan 16 '23 at 17:26
  • Rob's is https://www.compart.com/en/unicode/U+12219. – Philip Oakley Jan 16 '23 at 17:28
  • 1
    The East Asian Width feature (https://www.unicode.org/reports/tr11/) is designed to interoperate with legacy East Asian character sets. It's not reliable outside of that context, and doesn't suggest that Unicode means to encode layout issues. (Many things exist for backward compatibility in Unicode.) For example, is Neutral width, as are Å, DŽ, and ﷽ (most characters are neutral in fact), but A is Narrow. ₩ is "half-width" (though ¥ is narrow). Emoji are generally wide for historical reasons (they come out of Japanese encodings, so even new ones get categorized). – Rob Napier Jan 16 '23 at 19:20

0 Answers0