5

I am trying to get a (JSON formatted) String from a URL and consume it as a Json object. I lose UTF-8 encoding when I convert the String to JSONObject.

This is The function I use to connect to the url and get the string:

private static String getUrlContents(String theUrl) {
    StringBuilder content = new StringBuilder();
    try {
        URL url = new URL(theUrl);
        URLConnection urlConnection = url.openConnection();
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));

        String line;
        while ((line = bufferedReader.readLine()) != null) {
            content.append(line + "\n");
        }
        bufferedReader.close();
    } catch(Exception e) {
        e.printStackTrace();
    }

    return content.toString();
}

When I get data from server, the following code displays correct characters:

String output = getUrlContents(url);
Log.i("message1", output);

But when I convert the output string to JSONObject the Persian characters becomes question marks like this ??????. (messages is the name of array in JSON)

JSONObject reader = new JSONObject(output);
String messages = new String(reader.getString("messages").getBytes("ISO-8859-1"), "UTF-8");
Log.i("message2", messages);
Ziem
  • 6,579
  • 8
  • 53
  • 86
Ali Sheikhpour
  • 10,475
  • 5
  • 41
  • 82

4 Answers4

6

You're telling Java to convert the string (with key message) to bytes using ISO-8859-1 and than to create a new String from these bytes, interpreted as UTF-8.

new String(reader.getString("messages").getBytes("ISO-8859-1"), "UTF-8");

You could simply use:

String messages = reader.getString("messages");
toKrause
  • 512
  • 1
  • 4
  • 13
  • This works because the bytes you're receiving over the wire are already interpreted correctly in `getUrlContents` and are internally stored as a UTF-16 string. – toKrause Jan 11 '16 at 09:42
  • `getUrlContents ` only works when the server's character encoding matches the client's. – Alastair McCormack Jan 15 '16 at 21:09
1

You can update your code as the following:

    private static String getUrlContents(String theUrl) {
        StringBuilder content = new StringBuilder();
        try {
            URL url = new URL(theUrl);
            URLConnection urlConnection = url.openConnection();
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "utf-8"));

            String line;
            while ((line = bufferedReader.readLine()) != null) {
                content.append(line).append("\n");
            }
            bufferedReader.close();
        } catch(Exception e) {
            e.printStackTrace();
        }

        return content.toString().trim();
    }
BNK
  • 23,994
  • 8
  • 77
  • 87
1

You've got two encoding issues:

  1. The server sends text encoded in a character set. When you setup your InputStreamReader, you need to pass the encoding the server used so it can be decoded properly. The character encoding is usually given in the Content-type HTTP response, in the charset field. JSON is typically UTF-8 encoded, but can also be legally UTF-16 and UTF-32, so you need to check. Without a specified encoding, your system environment will be used when marshalling bytes to Strings, and vice versa . Basically, you should always specify the charset.

  2. String messages = new String(reader.getString("messages").getBytes("ISO-8859-1"), "UTF-8"); is obviously going to cause issues (if you have non-ascii characters) - it's encoding the string to ISO-8995-1 and then trying to decode it as UTF-8.

A simple regex pattern can be used to extract the charset value from the Content-type header before reading the inputstream. I've also included a neat InputStream -> String converter.

private static String getUrlContents(String theUrl) {

    try {
        URL url = new URL(theUrl);
        URLConnection urlConnection = url.openConnection();
        InputStream is = urlConnection.getInputStream();

        // Get charset field from Content-Type header
        String contentType = urlConnection.getContentType();
        // matches value in key / value pair
        Pattern encodingPattern = Pattern.compile(".*charset\\s*=\\s*([\\w-]+).*");
        Matcher encodingMatcher = encodingPattern.matcher(contentType);
        // set charsetString to match value if charset is given, else default to UTF-8
        String charsetString = encodingMatcher.matches() ? encodingMatcher.group(1) : "UTF-8";

        // Quick way to read from InputStream.
        // \A is a boundary match for beginning of the input
        return new Scanner(is, charsetString).useDelimiter("\\A").next();
    } catch(Exception e) {
        e.printStackTrace();
    }

    return null;
}
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
0

Not sure if this will help, but you might be able to do something like this:

JSONObject result = null;
String str = null;
try 
{           
    str = new String(output, "UTF-8");
    result = (JSONObject) new JSONTokener(str).nextValue();
} 
catch (Exception e) {}

String messages = result.getString("messages");
jt-gilkeson
  • 2,661
  • 1
  • 30
  • 40