1

I got a lot of issues of string indexing from C# to python. Basically the existing data pipeline (in C#) generates some string indices for a python model to consume. What happened is that these two languages are using different codepoints in their respective unicode systems, as summarized here: http://illegalargumentexception.blogspot.com/2010/04/i18n-comparing-character-encoding-in-c.html

Hence, the string length and indices in C# (16-bit, implicit utf-16) are not 100% relevant in Python (16 or 32). Sometimes, Python generates a smaller string lengths than C# if a character is more than 0xFFFF (more than 16 bits).

The question is: is there any way to make sure the string indexing and lengths are identical? Is it possible to enforce, say, Python to use implicit 16-bit as in C#?

A concrete example is this:

, Ṣur

And its utf-8 bytes:

b'\xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur'

In Python, the length of this string is 12 where as C# reports 15. Indexing will be also off from one language to another.

Yo Hsiao
  • 678
  • 7
  • 12
  • That's not a string in Python: it's a sequence of raw bytes. – Jonathon Reinhart Dec 19 '17 at 03:07
  • It is a string ", Ṣur" with utf-8 encoding. Since copy and paste the characters may not be reproducible. I copy and paste the bytes for investigation. – Yo Hsiao Dec 19 '17 at 03:14
  • If you call `.decode('utf-8')` then you will have a string. But what you've shown is not a string. – Jonathon Reinhart Dec 19 '17 at 03:16
  • @JonathonReinhart I update the question with the original text for clarification. Regardlessly, python and c# report different lengths and use different indices for the string. – Yo Hsiao Dec 19 '17 at 03:18
  • 1
    You may want to look at using the StringInfo class in C#. This class is designed to allow inspection of a string's individual graphemes rather than individual UTF-16 code points which may be grouped into a single visual "character". I'm not sure if there's something similar in Python. – Mike Zboray Dec 19 '17 at 03:24

1 Answers1

2

You likely want to use the StringInfo class per this answer here: Why is the length of this string longer than the number of characters in it?

using System;
using System.Text;
using System.Globalization;

namespace StackOverflow {
    class Program {
        public static void Main(string[] args) {
            var s = ", Ṣur";
            // Len == 11
            Console.WriteLine("{0}: {1}", s, s.Length);

            // len == 8
            var si = new StringInfo(s);
            Console.WriteLine("{0}: {1}", s, si.LengthInTextElements);
        }
    }
}

Or, on the Python side, you can try this, but it's not quite identical to C#'s length because it assumes 2-bytes so it only covers the first 65,536 UTF-16 characters:

#!/usr/bin/env python3

s = ", Ṣur"
# len == 8 (displayable len)
print("{}: {}".format(s, len(s)))

# len == 11 (C# wackiness)
print(int(len(s.encode("utf-16")) / 2) - 1)
mattmc3
  • 17,595
  • 7
  • 83
  • 103
  • 1
    Thanks! This solves the direction from C# to Python. Do you have any suggestion for the other direction from Python to C#? – Yo Hsiao Dec 19 '17 at 05:26
  • Just out of curiosity: When you have combining diacritics, will StringInfo count them as separate characters? If not, this will again differ from how Python counts characters... – lenz Dec 19 '17 at 09:09
  • @YoHsiao - I added a semi-equivalent UTF-16 length example for Python. – mattmc3 Dec 19 '17 at 14:50