20

Let us suppose we want to revert the following String "áe".

The unicode for that is "\u0061\u0301\u0065".

The naive aproach of reverting it would be char by char

private static String reverseStringNaive(String s) {
    char[] characters = new char[s.length()];
    for (int i = s.length() - 1; i >= 0; i--) {
        int j = s.length() - i - 1;
        characters[j] = s.charAt(i); 
    }
    return new String(characters);
}

which gives us "éa"(\u0065\u0301\u0061) when we hope to get "eá" (\u0065\u0061\u0301). The accute accent "´" should stick together with the "a", not change to the "e".

The following code gives me the expected result for that String:

private static String reverseString(String s) {
    char[] characters = new char[s.length()];
    for (int i = s.length() - 1; i >= 0; i--) {
        int j = s.length() - i - 1;
        if (Character.isLetterOrDigit(s.charAt(i)) || Character.isISOControl(s.charAt(i))) {
            characters[j] = s.charAt(i); 
        } else {
            characters[j] = s.charAt(i-1);
            characters[j+1] = s.charAt(i);
            i--;
        }
    }
    return new String(characters);
}

I'm checking if each character is Letter, Digit or ISO Control. If not, I'm assuming it should stick together with the previous character.

The question is, are there other things I should check or worry about? Is my aproach still naive?

pablosaraiva
  • 2,343
  • 1
  • 27
  • 38
  • What about http://commons.apache.org/lang/api-2.5/org/apache/commons/lang/StringUtils.html#reverse(java.lang.String) ? –  Sep 19 '11 at 19:59
  • It does the naive aproach. Gives the wrong result. – pablosaraiva Sep 19 '11 at 20:00
  • According to javadoc, it use a `StringBuffer.reverse()` and this should work see http://download.oracle.com/javase/1.5.0/docs/api/java/lang/StringBuffer.html#reverse() –  Sep 19 '11 at 20:03
  • I tried it already. Does not work. Gives the wrong result. – pablosaraiva Sep 19 '11 at 20:08
  • 2
    Why would anybody want to do that? :-) – KarlP Sep 19 '11 at 20:25
  • 1
    Would a change in normalization form be tolerated? – Dilum Ranatunga Sep 19 '11 at 20:32
  • 4
    You must never use `charAt`. It’s broken. **YOU JUST ILLEGALLY REVERSED ALL YOUR SURROGATES!!!** – tchrist Sep 19 '11 at 21:41
  • @tchrist How to revert it without breaking it? – pablosaraiva Sep 19 '11 at 21:53
  • 2
    @pablosaraiva You have to step through it by grapheme boundaries, not code point boundaries and certainly not code unit boundaries. It looks like `icu.text.BreakIterator.getCharacterInstance` can be used to build something, but I haven’t tried. It’s much easier in Perl with its support for extended grapheme clusters: `reverse /(\X)/g` is all it takes. If you don’t do that, you can also invert CRLF combos, enough error. – tchrist Sep 19 '11 at 22:51

1 Answers1

4

Your issue could also be resolved by converting the string into the canonical decomposition form NFC. Basically, the java.text.Normalizer class can be used to combine accents and other combining characters with their base characters so you will be able to reverse properly.

All these other ideas (String.reverse(), StringBuffer.reverse()) will correctly reverse the characters in your buffer, but if you start with decomposed characters, you might not get what you expect :).

In some "decomposition forms", accent characters are stored separate from their base forms (as separate characters), but in "combined" form they are not. So in one form "áe" is stored as three characters, and in the other, combined form, as two.

However, such normalization isn't sufficient for handling other kinds of character combination, nor can it account for characters in the Unicode astral planes, which are stored as two characters (or more?) in Java.

Thanks to tchrist for pointing out the ICU support for text segmentation including extended grapheme clusters such as the one identified in the comments below (see virama). This resource seems to be the authoritative source of information on this kind of stuff.

Mike Sokolov
  • 6,914
  • 2
  • 23
  • 31
  • It is indeed a nice a aproach that works for the input I gave you, but it fails to the following string: सरस्वती. – pablosaraiva Sep 19 '11 at 20:25
  • cool - can you post the code point sequence of that? Maybe these are surrogate pairs? – Mike Sokolov Sep 19 '11 at 20:27
  • 1
    Mike, I would have thought normalization to NFC would be the way to go, so each surrogate pair has the best chance of representing the intended abstract character... – Dilum Ranatunga Sep 19 '11 at 20:34
  • The unicode codepoints for सरस्वती is \u0938\u0930\u0938\u094d\u0935\u0924\u0940. – pablosaraiva Sep 19 '11 at 20:35
  • and how should it look reversed would you say? My knowledge of that script is - um - limited – Mike Sokolov Sep 19 '11 at 21:05
  • From what I can understand, that character sequence includes a "virama" which acts as a kind of ligature tying together its two neighboring characters into a single entity. In a situation like this I'm not at all sure there is a sensible way of reversing the characters. At any rate you would have to supply some specific knowledge of the script that goes beyond what is available in typical Unicode utilities in order to avoid breaking up this character cluster. – Mike Sokolov Sep 19 '11 at 21:22
  • The only way to do this is to grab an extended grapheme cluster at a time and reverse those. – tchrist Sep 19 '11 at 21:42
  • @tchrist how to get the extend grapheme cluster? – pablosaraiva Sep 19 '11 at 21:45
  • Yes, that sounds right - but are you aware of any commonly-available utilities for text segmentation that would identify extended grapheme clusters? I don't know of any myself. – Mike Sokolov Sep 19 '11 at 21:45
  • 1
    @Mike: You have to use ICU. OraSun can’t do it. Also, you cannot use `String.reverse()`, `StringBuffer.reverse()` because you break all the surrogates. Try flipping `MATHEMATICAL FRAKTUR CAPITAL A`, `COMBINING DOT ABOVE`, `MATHEMATICAL FRAKTUR CAPITAL B`: U+1D504 U+0307 U+1D505. The right answer is `MATHEMATICAL FRAKTUR CAPITAL B`, `MATHEMATICAL FRAKTUR CAPITAL A`, `COMBINING DOT ABOVE`: U+1D505 U+1D504 U+0307. The naïve way flips surrogates and creates illegal noncharacter sequences. – tchrist Sep 19 '11 at 21:46
  • @tchrist Nice example! Well I guess for various subsets of all characters, some less-than-perfect approaches may be usable in some circumstances, but in general, yes, I see that this is a complicated problem! – Mike Sokolov Sep 19 '11 at 21:56
  • 3
    Normalization doesn't sound right: Turning base+modifier pairs into single characters is decidedly a legacy operation, and in general there *aren't* any compound characters for a given sequence of modifiers -- that's the whole point of separating the base character from the modifiers. So I would abandon this approach entirely and rather look for a proper Unicode library that understands combining characters. – Kerrek SB Sep 19 '11 at 21:57
  • 3
    @Mike: It looks like you need to use the `getCharacterInstance` method from `icu.text.BreakInterator`. Gosh, what a terrible pain. I’m use to being able to just do `reverse /(\X)/g` in Perl. – tchrist Sep 19 '11 at 22:02