6

I have this code in JAVA and works fine

    String a = "ABC";
    System.out.println(a.length());
    for (int n = 0; n < a.length(); n++)
        System.out.println(a.codePointAt(n));

The output as expected is 3 65 66 67 I am a little confused aboud a.length() because it is suposed to return the length in chars but String must store every < 256 char in 16 bits or whatever a unicode character would need.

But the question is how can i do the same i C#?. I need to scan a string and act depending on some unicode characters found.

The real code I need to translate is

    String str = this.getString();
    int cp;
    boolean escaping = false;
    for (int n = 0; n < len; n++)
    {
        //===================================================
        cp = str.codePointAt(n); //LOOKING FOR SOME EQUIVALENT IN C#
        //===================================================
        if (!escaping)
        {
          ....

       //Closing all braces below.

Thanks in advance.

How much i love JAVA :). Just need to deliver a Win APP that is a cliend of a Java / Linux app server.

mdev
  • 472
  • 7
  • 18

2 Answers2

5

The exact translation would be this :

string a = "ABC⤶"; //Let's throw in a rare unicode char
Console.WriteLine(a.Length);
for (int n = 0; n < a.Length; n++)
    Console.WriteLine((int)a[n]); //a[n] returns a char, which we can cast in an integer
//final result : 4 65 66 68 10550

In C# you don't need codePointAt at all, you can get the unicode number directly by casting the character into an int (or for an assignation, it's casted implicitly). So you can get your cp simply by doing

cp = (int)str[n];

How much I love C# :)

However, this is valid only for low Unicode values. Surrogate pairs are handled as two different characters when you break the string down, so they won't be printed as one value. If you really need to handle UTF32, you can refer to this answer, which basically uses

int cp = Char.ConvertToUtf32(a, n);

after incrementing the loop by two (because it's coded on two chars), with the Char.IsSurrogatePair() condition.

Your translation would then become

string a = "ABC\U0001F01C";
Console.WriteLine(s.Count(x => !char.IsHighSurrogate(x)));
for (var i = 0; i < a.Length; i += char.IsSurrogatePair(a, i) ? 2 : 1)
    Console.WriteLine(char.ConvertToUtf32(a, i));

Please note the change from s.Length() to a little bit of LINQ for the count, because surrogates are counted as two chars. We simply count how many characters are not higher surrogates to get the clear count of actual characters.

Community
  • 1
  • 1
Pierre-Luc Pineault
  • 8,993
  • 6
  • 40
  • 55
  • @SotiriosDelimanolis Because if you don't you're printing the char directly and not the unicode number? – Pierre-Luc Pineault May 20 '14 at 05:06
  • So for printing purposes. If the underlying value is a `char`, how is this equivalent to Java which returns surrogates (`int`) with value greater than can fit in a `char`? – Sotirios Delimanolis May 20 '14 at 05:08
  • @SotiriosDelimanolis Yeah just verified that, it indeed possible to assign without the cast, and ReSharper will flag it as redundant statement. – Pierre-Luc Pineault May 20 '14 at 05:10
  • Thanks for the answer. I am starting to like C# a little more :). No doubt it is the best if you are going to run on Windows. – mdev May 20 '14 at 05:15
  • Can you please address my previous comment? I'm still not convinced this is equivalent to `codePointAt`. – Sotirios Delimanolis May 20 '14 at 05:16
  • @SotiriosDelimanolis Looks like it'll be printed as two different chars, which won't work like OP expect for surrogate pairs. It is correct for low Unicodes though. – Pierre-Luc Pineault May 20 '14 at 05:24
  • @mdev This solution is not equivalent to `codePointAt`. You might want to review. – Sotirios Delimanolis May 20 '14 at 05:27
  • The last part is correct and I liked it but the first part wasn't needed at all and as you mentioned it won't work for non-BMP characters. `cp = (int)str[n];` is the exactly same as `cp = (int)str.charAt(n);` in Java, but not `codePointAt()`. Moreover, as you can tell, you don't even have to cast like that in most cases if a string consists of BMP characters only both in Java and C#. In the first place, OP's Java code was wrong and that made you misunderstand it, I guess. – Jenix Sep 04 '18 at 23:49
0

The following code gets the codpoint of a part of a string

var s = "\uD834\uDD61";
for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
{
    var codepoint = char.ConvertToUtf32(s, i);
    Console.WriteLine("U+{0:X4}", codepoint);
}
musium
  • 2,942
  • 3
  • 34
  • 67