26

So I saw Jon's skeet video and there was a code sample :

There should have been a problem with the é - after reversing but I guess it fails on .net2 (IMHO), anyway it did work for me and I did see the correct reversed string.

char[] a="Les Misérables".ToCharArray();
Array.Reverse(a);
string n= new string(a);
Console.WriteLine (n); //selbarésiM seL

But I took it further:

In Hebrew there is the "Alef" char : א

and I can add punctuation like : אֳ ( which I believe consists of 2 chars - yet displayed as one.)

But now look what happens :

char[] a="Les Misאֳrables".ToCharArray();
Array.Reverse(a);
string n= new string(a);
Console.WriteLine (n); //selbarֳאsiM seL

There was a split...

I can understand why it is happening :

Console.WriteLine ("אֳ".Length); //2

So I was wondering if there's a workaround for this kind of issue in C# ( or should I build my own mechanism....)

Soner Gönül
  • 97,193
  • 102
  • 206
  • 364
Royi Namir
  • 144,742
  • 138
  • 468
  • 792
  • 4
    [TextElementEnumerator](http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.aspx) might be useful here. – Michael Liu Feb 22 '13 at 16:55
  • So `א` is two `chars`? – Jodrell Feb 22 '13 at 17:00
  • You should add this as answer Michael. Was just writing there's no such thing in .NET... Good job. – Nikola Radosavljević Feb 22 '13 at 17:02
  • @Jodrell, actually, if you paste Misאֳrables in Visual Studio editor, set the cursor on left of the אֳ character, and press the right key, you will see it goes directly to the a, instead of the r. It also has problems going left (stays blocked during one keypress)... – Simon Mourier Feb 22 '13 at 17:28
  • 8
    On the _Misérables_ thing: With accented letters such as the French `é` there might be two or more ways to encode them in Unicode. Either as **one** single code point which holds the entire accented letter, or as a simple `e` followed by one or more non-spacing "combining" accent characters. So if you want to get a "problematic" `Les Misérables` string, start with either `string m1 = "Les Misérables".Normalize(NormalizationForm.FormD);` or `string m2 = "Les Mise\u0301rables";`. Before reversing, the accent is over the `e`. After careless reversing, the accent goes over the `r`, that is `ŕ`. – Jeppe Stig Nielsen Jul 23 '13 at 15:10

2 Answers2

41

The problem is that Array.Reverse isn't aware that certain sequences of char values may combine to form a single character, or "grapheme", and thus shouldn't be reversed. You have to use something that understands Unicode combining character sequences, like TextElementEnumerator:

// using System.Globalization;

TextElementEnumerator enumerator =
    StringInfo.GetTextElementEnumerator("Les Misאֳrables");

List<string> elements = new List<string>();
while (enumerator.MoveNext())
    elements.Add(enumerator.GetTextElement());

elements.Reverse();
string reversed = string.Concat(elements);  // selbarאֳsiM seL
Michael Liu
  • 52,147
  • 13
  • 117
  • 150
10

If you made the extension

public static IEnumerable<string> ToTextElements(this string source)
{
    var e = StringInfo.GetTextElementEnumerator(source)
    while (e.MoveNext())
    {
        yield return e.GetTextElement();
    }
}

you could do,

const string a = "AnyStringYouLike";
var aReversed = string.Concat(a.ToTextElements().Reverse());
Jodrell
  • 34,946
  • 5
  • 87
  • 124