2

[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]

Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.

This pseudo-algorithm handles all the easy cases I know about:

  1. Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)
  2. Reverse the order of this list.
  3. For each string in the list:
    1. Segment the string into grapheme clusters.
    2. Reverse the order of the grapheme clusters.
    3. Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)
    4. Join the sequence back into a string.
  4. Recombine the list of reversed words to produce the final reversed string.

... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?

Community
  • 1
  • 1
zwol
  • 135,547
  • 38
  • 252
  • 361
  • 1
    ICU has a grapheme iterator. – Kerrek SB May 18 '13 at 21:44
  • Or try [Ogonek](http://flamingdangerzone.com/ogonek/). – Kerrek SB May 18 '13 at 22:01
  • What is the question? – David Heffernan May 18 '13 at 22:13
  • The question is "what is missing from the pseudocode, and how does one handle bidirectional overrides?" I'll try to make it clearer. – zwol May 19 '13 at 00:48
  • You have to run the bidi algorithm to transform the string to the visual order, then force the order with bidi-overrides. It is not necessarily correct to change a Hebrew letter to its final form as they are usually not used in non-words (abbreviations, numbers). I don't even want to think about reversing an Indic script. – n. m. could be an AI May 19 '13 at 03:39
  • I think requirements need to be clearer. You describe the problem as "*visual* string reversal", but then include *semantic* changes like handling the final forms. Which is it? A merely visual reversal would not arbitrarily change the visual aspect of a character. – R. Martinho Fernandes Oct 29 '13 at 09:17
  • 1
    I'm not sure how to be clearer, since I am not familiar with all of the languages involved. What I want is the closest practical approximation to "what someone fluent in all the relevant languages would write if asked to write a string backward." – zwol Oct 29 '13 at 15:26

0 Answers0