Fully correct Unicode visual string reversal

Question

[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]

Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.

This pseudo-algorithm handles all the easy cases I know about:

Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)

Reverse the order of this list.

For each string in the list:

Segment the string into grapheme clusters.

Reverse the order of the grapheme clusters.

Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)

Join the sequence back into a string.

Recombine the list of reversed words to produce the final reversed string.

... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?

The question is "what is missing from the pseudocode, and how does one handle bidirectional overrides?" I'll try to make it clearer. — zwol, May 19 '13 at 00:48
You have to run the bidi algorithm to transform the string to the visual order, then force the order with bidi-overrides. It is not necessarily correct to change a Hebrew letter to its final form as they are usually not used in non-words (abbreviations, numbers). I don't even want to think about reversing an Indic script. — n. m. could be an AI, May 19 '13 at 03:39
I think requirements need to be clearer. You describe the problem as "*visual* string reversal", but then include *semantic* changes like handling the final forms. Which is it? A merely visual reversal would not arbitrarily change the visual aspect of a character. — R. Martinho Fernandes, Oct 29 '13 at 09:17
I'm not sure how to be clearer, since I am not familiar with all of the languages involved. What I want is the closest practical approximation to "what someone fluent in all the relevant languages would write if asked to write a string backward." — zwol, Oct 29 '13 at 15:26

Fully correct Unicode visual string reversal

0 Answers0

Linked