0

I have the following restful client in python:

import requests;
s= 'وإليك ما يقوله إثنان من هؤلاء';
resp = requests.post('http://localhost:8080/MyApp/webresources/production/sendSentence', json={'sentence': s,} )

the aforementionned code call a web service implemented in java which returns the same sentence sent from the requests client.

this is the java webservice:

@POST
@Consumes("application/json")
@Produces("text/html; charset=UTF-8")
@Path("/sendSentence")
public String sendSentence(@Context HttpServletRequest requestContext, String valentryJson) throws Exception {
    try {
        if (valentryJson != null) {
            JSONObject jsonObject;
            jsonObject = new JSONObject(valentryJson);
            String sentence = jsonObject.getString("sentence");

            return sentence;
        }
    } catch (JSONException ex) {
    }
    return "";
}

the problem is the encoding because when i try to print the content this is the result:

>>> resp.content

'\xd9\x88\xd8\xa5\xd9\x84\xd9\x8a\xd9\x83 \xd9\x85\xd8\xa7 \xd9\x8a\xd9\x82\xd9\x88\xd9\x84\xd9\x87 \xd8\xa5\xd8\xab\xd9\x86\xd8\xa7\xd9\x86 \xd9\x85\xd9\x86 \xd9\x87\xd8\xa4\xd9\x84\xd8\xa7\xd8\xa1'

Or when I use print:

>>> print resp.content

    ظˆط¥ظ„ظٹظƒ ظ…ط§ ظٹظ‚ظˆظ„ظ‡ ط¥ط«ظ†ط§ظ† ظ…ظ† ظ‡ط¤ظ„ط§ط،
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Riadh Belkebir
  • 797
  • 1
  • 12
  • 34
  • 1
    You didn't decode the content. Presumably this is Python 2? – Martijn Pieters Nov 24 '16 at 13:33
  • I have tried to decode it using resp.content.decode('utf8') it does not work – Riadh Belkebir Nov 24 '16 at 13:34
  • yes, it is python 2.6.5 – Riadh Belkebir Nov 24 '16 at 13:35
  • 1
    **How** does that not work? Do you get a `UnicodeEncodeError` when you print the result to the console? If so, then your console can't handle your text (Python has to encode to print, and your console is not configured to receive Arabic text). – Martijn Pieters Nov 24 '16 at 13:36
  • 1
    Note that your first example would *fail* in Python 2.6, unless you used a hack to change the built-in default codec used for implicit decoding and encoding. Don't do that, that hack was disabled for a reason. – Martijn Pieters Nov 24 '16 at 13:38
  • I am sorry for the mistake resp.content.decode('utf8') now works. I think I have tried something else. This is why I thought it doesn not work. Thanks @MartijnPieters again – Riadh Belkebir Nov 24 '16 at 13:51

1 Answers1

2

Your Java webservice produces HTML, UTF-8 encoded:

@Produces("text/html; charset=UTF-8")

but you took the raw bytes returned without decoding:

>>> resp.content

response.content gives you bytes, not Unicode text. You could use the resp.text attribute instead, which uses the charset parameter of the Content-Type header to decode your data:

>>> resp.text
u'\u0648\u0625\u0644\u064a\u0643 \u0645\u0627 \u064a\u0642\u0648\u0644\u0647 \u0625\u062b\u0646\u0627\u0646 \u0645\u0646 \u0647\u0624\u0644\u0627\u0621'
>>> print resp.text
وإليك ما يقوله إثنان من هؤلاء

Be careful however; if no charset parameter is present, but the content-type header indicates this is a text/... content type (like text/html), then requests will follow the HTTP RFCs and decode the data as Latin-1. This'll silently work but may not be the correct codec. For HTML data, use a HTML parser instead, pass in the bytestring, and leave it to the parser to extract what codec is correct (HTML often records the right encoding in a <meta> tag). See retrieve links from web page using python and BeautifulSoup.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343