String character encoding

Question

We developed a specific exporter for them which allows the position based product to provide a type of portfolio snapshot - both for equities and fixed income portfolios.

We developed a specific exporter for them which allows the position based product to provide a type of portfolio snapshot â€“ both for equities and fixed income portfolios.

The 1st text is what I copy from Jira, the second is what gets printed in Cognity. I get the text from Jira in a JSON format via the REST API and format it with a string builder and finally return a normal String as the output. All the symbols like " ' - etc. don't get printed right and I get a lot of â€“ in the output text. How can I fix that? I was thinking if there was some way I could change the encoding of the output String, maybe that might work?

EDIT: This is what I use to get the information from Jira after which I extract what I want from the JSON returned.

   String usercreds = "?os_username=user&os_password=password";
   try {
        url = new URL("http://jira/rest/api/2/issue/" + issuekey + usercreds);

        URLConnection urlConnection = url.openConnection();

        if (url.getUserInfo() != null) {
            String basicAuth = "Basic " + new String(new Base64().encode(url.getUserInfo().getBytes()));
            urlConnection.setRequestProperty("Authorization", basicAuth);
        }

        InputStream inputStream = urlConnection.getInputStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
        while ((s = reader.readLine()) != null) {
            temp.append(s);
            s = "";
        }
        issue = new JSONObject(temp.toString());
        temp.setLength(0);
    } catch (IOException e) {
        e.printStackTrace();
    } catch (JSONException e) {
        e.printStackTrace();
    }

If I understood correctly, there should be a way for me to specify that I want the output to be ("application/json;charset=utf-8") somewhere in this code and that might solve my prolbem?

You can't change the encoding of a string - but you can affect the conversion between a string and bytes. Unfortunately it's not sufficiently clear where you're seeing this data and what else is going on to know how to help you. Please provide more context and diagnostic information. — Jon Skeet, Dec 11 '13 at 14:11
The original data is in the fields of a Jira issue, I use the REST API to get the whole issue information, which is returned to me as a JSON object. I then extract the wanted text from that JSON object and print it out in a Confluence page and there it doesn't show the said special characters. If this doesn't help, please ask me a more specific question so I can give you a better answer. — Schadenfreude, Dec 11 '13 at 14:16
Well the first thing to do is work out *where* it's getting broken. Log the exact characters as UTF-16 code units (and length of the string) at each stage, and that will help to pinpoint the issue. What encoding does Confluence use, and can you affect that? — Jon Skeet, Dec 11 '13 at 14:26
Ok, I'll have to do a bit more research. I didn't think the solution would be that hard. I'll get back to you when I get more time to work on this issue. — Schadenfreude, Dec 11 '13 at 14:41

score 3 · Accepted Answer · answered Dec 11 '13 at 16:28

The dash in the JSON response is U+2013 (EN DASH.) When encoded as UTF-8 if forms the byte sequence e2 80 93. This data is being decoded using the wrong encoding (windows-1252 most likely.) Java's default I/O encoding is system-dependent.

BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));

The above line is at fault. You must specify an encoding when transcoding using InputStreamReader.

For example:

  public static void readUtf8(URLConnection connection, Appendable out)
      throws IOException {
    CharBuffer buffer = CharBuffer.allocate(1024);
    try (InputStream in = connection.getInputStream();
    Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8)) {
      while (reader.read(buffer) != -1) {
        buffer.flip();
        out.append(buffer);
        buffer.clear();
      }
    }
  }

Note: technically, JSON can be any Unicode encoding (not just UTF-8) - if you need to handle that read this.

Note 2: HttpUrlConnection seems to have improved since Java 5, but I would make sure it does automatic length handling (reading Content-Length header/handling chunked encoding/etc.)

Changed `BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));` to `BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));` and everything showed up fine. — Schadenfreude, Dec 11 '13 at 16:41

String character encoding

1 Answers1