utf-8 string indices in python not compatible in java

Question

I have a text file with the following content:

 \n==================\0No. 4♨ ==\n \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\nemailaddress@gmail.com

I have a python code running in the server to find the indices which I want to pass with the text for the highlighting purposes on the client. Following is the code for that:

import re
f = open('data.json', 'r')
text = f.readline().strip().decode('UTF-8').encode('UTF-8')
f.close()

for m in re.finditer(r'emailaddress', text, flags=re.IGNORECASE): 
    s = m.start()
    e = m.end()
    print s, e
    print text[s:e]

The output is:

123 135
emailaddress

Now on the client side, I have the java code (on android). HOwever these indices dont work at all.

public class HelloWorld {
    public static void main(String[] args) {
        String text = "\n==================\0No. 4♨ ==\n \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\nemailaddress@gmail.com";
        System.out.println(text.substring(**115**));
    }
}

And the output is:

l.com

I am sure I am making some mistake in the encoding of the strings. Can someone help me with that.

score 3 · Answer 1 · edited May 23 '17 at 11:58

The Python side works with UTF-8 encoded data (which vary in size), the Java code with UTF-16 codeunits^*. Indices into one do not map into the other.

You can see the issue when applying the index to your sample string, both as Unicode string and encoded to UTF-8, in a Python 2.7 UCS-2 build (which uses UTF-16 surrogate pairs like Java does):

>>> u"\n==================\0No. 4♨ ==\n \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\nemailaddress@gmail.com"[115:]
u'l.com'
>>> u"\n==================\0No. 4♨ ==\n \n✅IHappy Holi\n✅Ground Floor or Second Floor\n9910080224\nemailaddress@gmail.com".encode('utf8')[115:]
'\nemailaddress@gmail.com'

UTF-8 encodes Unicode codepoints to 1 and 4 codeunits per codepoint; how many codeunits are used then depends on the text:

>>> len(u'abc'.encode('utf8'))
3
>>> len(u'åßç'.encode('utf8'))
6

while decoding Unicode to an internal UTF-16 representation (like Java does, and Python 2.7 with the default narrow UCS-2 build), most characters use just the one codeunit, while characters outside of the BMP (like emoticons) use 2:

>>> u"✅"
u'\U0001f534\U0001f4cc\u2705'
>>> len(u"✅")
5
>>> u"✅".encode('utf8')
'\xf0\x9f\x94\xb4\xf0\x9f\x93\x8c\xe2\x9c\x85'
>>> len(u"✅".encode('utf8'))
11

Either run your regex on a Unicode value in Python (e.g. decode from UTF-8) or alter the Java code to operate on UTF-8 bytes rather than UTF-16 codeunits.

If you are using Unicode in Python, do take into account that you can also build the Python binary using UCS-4 for Unicode codepoints; you'd never see surrogates and the length of the string in Python will differ from that of the Java representation. Python 3.3 and up use a flexible storage where the internal representation will never use surrogates but instead scales to meet the requirements for each individual string.

In that case you may need to use JSR-204 methods to access codepoints on the Java side; I suspect that String.offsetByCodePoints() would be helpful here but I am not a Java developer.

You may want to brush up on Unicode and codecs; I recommend you read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

^* Java's String type uses UTF-16 words, which are 2 bytes per codeunit. For characters outside the BMP, that means two codeunits are used per character using surrogate pairs.

@Deduplicator: further editing. Lets all switch to Python 3 where all this is hidden a little better. — Martijn Pieters, Mar 04 '15 at 16:01
Thank you much for the links. I think I have a better(?) understanding of what I am doing. However I am still facing the problem. — Pankaj Garg, Mar 04 '15 at 19:36
@mirchiseth: so the path for you isn't clear? Either encode to UTF-8 in Java (so you get `bytes[]`, then slice from there, perhaps decode from UTF-8 afterward) or handle the string in Python as Unicode, but take into account that you then in Python you may have to account for surrogate pairs (if `sys.maxunicode == 0xffff` you have a UCS-2 build), and in Java you *certainly* will have to. — Martijn Pieters, Mar 04 '15 at 19:55
i figured out finally that there is some problem in the python version. With python version 2.7.8 it is working perfectly fine. However I am facing trouble now in fixing the versions of numpy and scipy etc.. — Pankaj Garg, Mar 04 '15 at 21:44

utf-8 string indices in python not compatible in java

1 Answers1