16

I am having weird character encoding issues with a JSON array that is grabbed from a web page. The server is sending back this header:

Content-Type text/javascript; charset=UTF-8

Also I can look at the JSON output in Firefox or any browser and Unicode characters display properly. The response will sometimes contain words from another language with accent symbols and such. However I am getting those weird question marks when I pull it down and put it to a string in Java. Here is my code:

HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params, "utf-8");
params.setBooleanParameter("http.protocol.expect-continue", false);

HttpClient httpclient = new DefaultHttpClient(params);

HttpGet httpget = new HttpGet("http://www.example.com/json_array.php");
HttpResponse response;
    try {
        response = httpclient.execute(httpget);

        if(response.getStatusLine().getStatusCode() == 200){
            // Connection was established. Get the content. 

            HttpEntity entity = response.getEntity();
            // If the response does not enclose an entity, there is no need
            // to worry about connection release

            if (entity != null) {
                // A Simple JSON Response Read
                InputStream instream = entity.getContent();
                String jsonText = convertStreamToString(instream);

                Toast.makeText(getApplicationContext(), "Response: "+jsonText, Toast.LENGTH_LONG).show();

            }

        }


    } catch (MalformedURLException e) {
        Toast.makeText(getApplicationContext(), "ERROR: Malformed URL - "+e.getMessage(), Toast.LENGTH_LONG).show();
        e.printStackTrace();
    } catch (IOException e) {
        Toast.makeText(getApplicationContext(), "ERROR: IO Exception - "+e.getMessage(), Toast.LENGTH_LONG).show();
        e.printStackTrace();
    } catch (JSONException e) {
        Toast.makeText(getApplicationContext(), "ERROR: JSON - "+e.getMessage(), Toast.LENGTH_LONG).show();
        e.printStackTrace();
    }

private static String convertStreamToString(InputStream is) {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    BufferedReader reader;
    try {
        reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
    } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    StringBuilder sb = new StringBuilder();

    String line;
    try {
        while ((line = reader.readLine()) != null) {
            sb.append(line + "\n");
        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    return sb.toString();
}

As you can see, I am specifying UTF-8 on the InputStreamReader but every time I view the returned JSON text via Toast it has strange question marks. I am thinking that I need to send the InputStream to a byte[] instead?

Thanks in advance for any help.

Michael Taggart
  • 797
  • 1
  • 6
  • 11

5 Answers5

40

Try this:

if (entity != null) {
    // A Simple JSON Response Read
    // InputStream instream = entity.getContent();
    // String jsonText = convertStreamToString(instream);

    String jsonText = EntityUtils.toString(entity, HTTP.UTF_8);

    // ... toast code here
}
Vit Khudenko
  • 28,288
  • 10
  • 63
  • 91
  • Thanks for the response. I added your changes and imported the extra Apache stuff for EntityUtils but now the app just terminates unexpectedly on the EntityUtils.toString line. program compiles and runs, but do I need to do something to entity before calling toString? – Michael Taggart Dec 18 '10 at 22:42
  • never mind. I was an idiot and messed up something with my url. It works! The characters are rendered correctly! – Michael Taggart Dec 18 '10 at 22:47
  • 3
    @Michael: This answer is very good and I'd accept this one if I'd asked the question. – Ken Mar 24 '12 at 20:25
  • @SK9 Thanks for the reminder. I completely forgot to click the check mark. My apologies Arhimed. – Michael Taggart Mar 26 '12 at 18:47
  • And for posting `UrlEncodedFormEntity encodedFormEntity = new UrlEncodedFormEntity(nameValuePairs, HTTP.UTF_8); post.setEntity(encodedFormEntity);` :) – Muhammad Babar Aug 17 '14 at 12:47
  • you saved my time!! – Sjd Jan 24 '17 at 14:51
  • it works and with ```StringEntity```, for ex. ```StringEntity se = new StringEntity(params[0], HTTP.UTF_8);``` – H.Sarxha Aug 16 '21 at 08:46
6

@Arhimed's answer is the solution. But I cannot see anything obviously wrong with your convertStreamToString code.

My guesses are:

  1. The server is putting a UTF Byte Order Mark (BOM) at the start of the stream. The standard Java UTF-8 character decoder does not remove the BOM, so the chances are that it would end up in the resulting String. (However, the code for EntityUtils doesn't seem to do anything with BOMs either.)
  2. Your convertStreamToString is reading the character stream a line at a time, and reassembling it using a hard-wired '\n' as the end-of-line marker. If you are going to write that to an external file or application, you should probably should be using a platform specific end-of-line marker.
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

It is just that your convertStreamToString is not honoring encoding set in the HttpRespnose. If you look inside EntityUtils.toString(entity, HTTP.UTF_8), you will see that EntityUtils find out if there is encoding set in the HttpResponse first, then if there is, EntityUtils use that encoding. It will only fall back to the encoding passed in the parameter(in this case HTTP.UTF_8) if there isn't encoding set in the entity.

So you can say that your HTTP.UTF_8 is passed in the parameter but it never get used because it is the wrong encoding. So here is update to your code with the helper method from EntityUtils.

           HttpEntity entity = response.getEntity();
           String charset = getContentCharSet(entity);
           InputStream instream = entity.getContent();
           String jsonText = convertStreamToString(instream,charset);

    private static String getContentCharSet(final HttpEntity entity) throws ParseException {
    if (entity == null) {
        throw new IllegalArgumentException("HTTP entity may not be null");
    }
    String charset = null;
    if (entity.getContentType() != null) {
        HeaderElement values[] = entity.getContentType().getElements();
        if (values.length > 0) {
            NameValuePair param = values[0].getParameterByName("charset");
            if (param != null) {
                charset = param.getValue();
            }
        }
    }
    return TextUtils.isEmpty(charset) ? HTTP.UTF_8 : charset;
}



private static String convertStreamToString(InputStream is, String encoding) {
    /*
     * To convert the InputStream to String we use the
     * BufferedReader.readLine() method. We iterate until the BufferedReader
     * return null which means there's no more data to read. Each line will
     * appended to a StringBuilder and returned as String.
     */
    BufferedReader reader;
    try {
        reader = new BufferedReader(new InputStreamReader(is, encoding));
    } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    StringBuilder sb = new StringBuilder();

    String line;
    try {
        while ((line = reader.readLine()) != null) {
            sb.append(line + "\n");
        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    return sb.toString();
}
Win Myo Htet
  • 5,377
  • 3
  • 38
  • 56
0

Archimed's answer is correct. However, that can be done simply by providing an additional header in the HTTP request:

Accept-charset: utf-8

No need to remove anything or use any other library.

For example,

GET / HTTP/1.1
Host: www.website.com
Connection: close
Accept: text/html
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.10 Safari/537.36
DNT: 1
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: utf-8

Most probably your request doesn't have any Accept-Charset header.

Alan Deep
  • 2,037
  • 1
  • 14
  • 22
0

Extract the charset from the response content type field. You can use the following method to do this:

private static String extractCharsetFromContentType(String contentType) {
    if (TextUtils.isEmpty(contentType)) return null;

    Pattern p = Pattern.compile(".*charset=([^\\s^;^,]+)");
    Matcher m = p.matcher(contentType);

    if (m.find()) {
        try {
            return m.group(1);
        } catch (Exception e) {
            return null;
        }
    }

    return null;
}

Then use the extracted charset to create the InputStreamReader:

String charsetName = extractCharsetFromContentType(connection.getContentType());

InputStreamReader inReader = (TextUtils.isEmpty(charsetName) ? new InputStreamReader(inputStream) :
                    new InputStreamReader(inputStream, charsetName));
            BufferedReader reader = new BufferedReader(inReader);
brodoll
  • 1,851
  • 5
  • 22
  • 25