I got a lot of issues of string indexing from C# to python. Basically the existing data pipeline (in C#) generates some string indices for a python model to consume. What happened is that these two languages are using different codepoints in their respective unicode systems, as summarized here: http://illegalargumentexception.blogspot.com/2010/04/i18n-comparing-character-encoding-in-c.html
Hence, the string length and indices in C# (16-bit, implicit utf-16) are not 100% relevant in Python (16 or 32). Sometimes, Python generates a smaller string lengths than C# if a character is more than 0xFFFF (more than 16 bits).
The question is: is there any way to make sure the string indexing and lengths are identical? Is it possible to enforce, say, Python to use implicit 16-bit as in C#?
A concrete example is this:
, Ṣur
And its utf-8 bytes:
b'\xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur'
In Python, the length of this string is 12 where as C# reports 15. Indexing will be also off from one language to another.