6

How to get proper Java string from Python created string 'Oslobo\xc4\x91enja'? How to decode it? I've tryed I think everything, looked everywhere, I've been stuck for 2 days with this problem. Please help!

Here is the Python's web service method that returns JSON from which Java client with Google Gson parses it.

def list_of_suggestions(entry):
   input = entry.encode('utf-8')
   """Returns list of suggestions from auto-complete search"""
   json_result = { 'suggestions': [] }
   resp = urllib2.urlopen('https://maps.googleapis.com/maps/api/place/autocomplete/json?input=' + urllib2.quote(input) + '&location=45.268605,19.852924&radius=3000&components=country:rs&sensor=false&key=blahblahblahblah')
   # make json object from response
   json_resp = json.loads(resp.read())

   if json_resp['status'] == u'OK':
     for pred in json_resp['predictions']:
        if pred['description'].find('Novi Sad') != -1 or pred['description'].find(u'Нови Сад') != -1:
           obj = {}
           obj['name'] = pred['description'].encode('utf-8').encode('string-escape')
           obj['reference'] = pred['reference'].encode('utf-8').encode('string-escape')
           json_result['suggestions'].append(obj)

   return str(json_result)

Here is solution on Java client

private String python2JavaStr(String pythonStr) throws UnsupportedEncodingException {
    int charValue;
    byte[] bytes = pythonStr.getBytes();
    ByteBuffer decodedBytes = ByteBuffer.allocate(pythonStr.length());
    for (int i = 0; i < bytes.length; i++) {
        if (bytes[i] == '\\' && bytes[i + 1] == 'x') {
            // \xc4 => c4 => 196
            charValue = Integer.parseInt(pythonStr.substring(i + 2, i + 4), 16);
            decodedBytes.put((byte) charValue);
            i += 3;
        } else
            decodedBytes.put(bytes[i]);
    }
    return new String(decodedBytes.array(), "UTF-8");
}
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Ognjen Stanić
  • 505
  • 8
  • 17
  • 1
    You have UTF-8 data displayed as a Python string literal, decoding that to Unicode gives `Oslobođenja`. Presumably Java can handle UTF-8 data? – Martijn Pieters Sep 03 '13 at 14:03
  • 1
    maybe have a look to this question : http://stackoverflow.com/questions/5943152/string-decode-utf-8 – Freelancer Sep 03 '13 at 14:08
  • @MartijnPieters I get Python string u'Oslobo\u0111enja' which Java can't handle. Just can't. Java can handle 'Oslobo\xc4\x91enja'. Now I just need to decode it in, as you guesed: Oslobođenja. – Ognjen Stanić Sep 03 '13 at 14:08
  • @Freelancer I've looked, and that is just like x + 2 - 2. Nothing changed... Thanks anyway. – Ognjen Stanić Sep 03 '13 at 14:12
  • @Ognjen: Who said anything about decoding the value in Python? You have the Unicode value, but it's the UTF8 value is what Java can handle. – Martijn Pieters Sep 03 '13 at 14:14
  • @MartijnPieters Yes, sorry I forgot to mention. I've trieed to read u'Oslobo\u0111enja' in JSON from Java and Java can't do that. So I've done str.encode('utf-8') in Python and got: 'Oslobo\xc4\x91enja' which Java can read from JSON. But I don't know how to get from that: Oslobođenja – Ognjen Stanić Sep 03 '13 at 14:18
  • 1
    @Ognjen: Stick to the `json` module to produce valid JSON. `u'Oslobo\u0111enja'` is *not* JSON, that's a Python string literal. `"Oslobo\u0111enja"` *is*. – Martijn Pieters Sep 03 '13 at 14:20
  • @MartijnPieters Thanks man, but have you any idea how to do that in Python? In Python, when parsing JSON text I get json object and every key and every (string) value is unicode. How to get "Oslobo\u0111enja" from u'Oslobo\u0111enja'? – Ognjen Stanić Sep 03 '13 at 14:25
  • 1
    @Ognjen: What *are* you trying to do? If you are *loading* JSON in python, then `u'Oslobo\u0111enja'` is exactly what you want. That's a valid Unicode value right there. I assumed you were *producing* JSON for some Java code to read and were struggling with the Java side. – Martijn Pieters Sep 03 '13 at 14:28
  • @MartijnPieters Ok, brief expl. In Python web service, first I download some JSON from one Google's service, than I made json object from it (json.loads(resp)), and than I make shortened JSON from it for my app purpose. Java client is than reading shortened JSON with Google Gson. Everything works, but Java don't decode in right way that "Oslobo\xc4\x91enja" string. Just that. – Ognjen Stanić Sep 03 '13 at 14:36
  • @MartijnPieters Or, let me be very clear: how to get letter 'đ' from \xc4\x91? How can I get unicode character from \xc4\x91 in Java? – Ognjen Stanić Sep 03 '13 at 14:41
  • 1
    @Ognjen: Can you update your question to show the code for that? Either pass Unicode values to `json.dumps()` to produce valid JSON for Java to handle, or tell `json.dumps()` with the `encoding` parameter how to decode byte strings. – Martijn Pieters Sep 03 '13 at 14:41
  • 1
    @Ognjen: `'Oslobo\xc4\x91enja'` is a Python literal string notation for UTF8 bytes. I had Python interpret that string back to a byte string value, then decoded from UTF8. – Martijn Pieters Sep 03 '13 at 14:42
  • @MartijnPieters Ok in a minute I will update question.... – Ognjen Stanić Sep 03 '13 at 14:45
  • @MartijnPieters Thanks, I've found solution folowing your's and Keith's proposition. – Ognjen Stanić Sep 03 '13 at 15:10
  • @Ognjen: I only just now saw your code. You are returning a Python object string representation, not a JSON response. – Martijn Pieters Sep 03 '13 at 15:13
  • @MartijnPieters No, I'm returning a hash, a dictionary as a string. – Ognjen Stanić Sep 03 '13 at 15:15
  • @Ognjen: Exactly. How is Java supposed to interpret Python dictionary objects? Use actual JSON instead. – Martijn Pieters Sep 03 '13 at 15:16
  • @MartijnPieters Hmm I tried print json_resp (which is json object) and I've got every key/value pair in unicode. In that way surely Google Gson wouldn't parse it. Check question I've updated it with solution. Thank for your effort and time Martin I really appreciate it! Thanks, you helped me! – Ognjen Stanić Sep 03 '13 at 15:28
  • 1
    Don't put 'Solved' in the question title; instead, you can mark any one answer as accepted. – Martijn Pieters Sep 03 '13 at 15:42
  • 1
    @Ognjen: JSON uses `\u....` escape codes *as well*, and those any JSON parser can handle. It's the `u'..'` or `u"..."` strings that don't work for such parsers. – Martijn Pieters Sep 03 '13 at 15:43
  • @MartijnPieters Now everything is clear. Thanks, you are the boss! – Ognjen Stanić Sep 03 '13 at 15:48

2 Answers2

2

You are returning the string version of the python data structure.

Return an actual JSON response instead; leave the values as Unicode:

if json_resp['status'] == u'OK':
    for pred in json_resp['predictions']:
        desc = pred['description'] 
        if u'Novi Sad' in desc or u'Нови Сад' in desc:
            obj = {
                'name': pred['description'],
                'reference': pred['reference']
            }
            json_result['suggestions'].append(obj)

return json.dumps(json_result)

Now Java does not have to interpret Python escape codes, and can parse valid JSON instead.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • As would you English speaking people said: works like a charm! :) Thanks, this is far more elegant solution. I'm still learning Python. – Ognjen Stanić Sep 03 '13 at 15:42
1

Python escapes unicode characters by converting their UTF-8 bytes into a series of \xVV values, where VV is the hex value of the byte. This is very different from the java unicode escapes, which are just a single \uVVVV per character, where VVVV is hex UTF-16 encoding.

Consider:

\xc4\x91

In decimal, those hex values are:

196 145

then (in Java):

byte[] bytes = { (byte) 196, (byte) 145 };
System.out.println("result: " + new String(bytes, "UTF-8"));

prints:

result: đ
Keith
  • 4,144
  • 1
  • 19
  • 14