Can I use different codepoints in Python3?

Question

I got a lot of issues of string indexing from C# to python. Basically the existing data pipeline (in C#) generates some string indices for a python model to consume. What happened is that these two languages are using different codepoints in their respective unicode systems, as summarized here: http://illegalargumentexception.blogspot.com/2010/04/i18n-comparing-character-encoding-in-c.html

Hence, the string length and indices in C# (16-bit, implicit utf-16) are not 100% relevant in Python (16 or 32). Sometimes, Python generates a smaller string lengths than C# if a character is more than 0xFFFF (more than 16 bits).

The question is: is there any way to make sure the string indexing and lengths are identical? Is it possible to enforce, say, Python to use implicit 16-bit as in C#?

A concrete example is this:

, Ṣur

And its utf-8 bytes:

b'\xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur'

In Python, the length of this string is 12 where as C# reports 15. Indexing will be also off from one language to another.

That's not a string in Python: it's a sequence of raw bytes. — Jonathon Reinhart, Dec 19 '17 at 03:07
It is a string ", Ṣur" with utf-8 encoding. Since copy and paste the characters may not be reproducible. I copy and paste the bytes for investigation. — Yo Hsiao, Dec 19 '17 at 03:14
If you call `.decode('utf-8')` then you will have a string. But what you've shown is not a string. — Jonathon Reinhart, Dec 19 '17 at 03:16
@JonathonReinhart I update the question with the original text for clarification. Regardlessly, python and c# report different lengths and use different indices for the string. — Yo Hsiao, Dec 19 '17 at 03:18
You may want to look at using the StringInfo class in C#. This class is designed to allow inspection of a string's individual graphemes rather than individual UTF-16 code points which may be grouped into a single visual "character". I'm not sure if there's something similar in Python. — Mike Zboray, Dec 19 '17 at 03:24

mattmc3 · Accepted Answer · 2017-12-19T15:25:04.370

2

You likely want to use the StringInfo class per this answer here: Why is the length of this string longer than the number of characters in it?

using System;
using System.Text;
using System.Globalization;

namespace StackOverflow {
    class Program {
        public static void Main(string[] args) {
            var s = ", Ṣur";
            // Len == 11
            Console.WriteLine("{0}: {1}", s, s.Length);

            // len == 8
            var si = new StringInfo(s);
            Console.WriteLine("{0}: {1}", s, si.LengthInTextElements);
        }
    }
}

Or, on the Python side, you can try this, but it's not quite identical to C#'s length because it assumes 2-bytes so it only covers the first 65,536 UTF-16 characters:

#!/usr/bin/env python3

s = ", Ṣur"
# len == 8 (displayable len)
print("{}: {}".format(s, len(s)))

# len == 11 (C# wackiness)
print(int(len(s.encode("utf-16")) / 2) - 1)

edited Dec 19 '17 at 15:25

answered Dec 19 '17 at 03:37

mattmc3

17,595
7
83
103

1

Thanks! This solves the direction from C# to Python. Do you have any suggestion for the other direction from Python to C#? – Yo Hsiao Dec 19 '17 at 05:26
Just out of curiosity: When you have combining diacritics, will StringInfo count them as separate characters? If not, this will again differ from how Python counts characters... – lenz Dec 19 '17 at 09:09
@YoHsiao - I added a semi-equivalent UTF-16 length example for Python. – mattmc3 Dec 19 '17 at 14:50

Can I use different codepoints in Python3?

1 Answers1