1

I'm trying to make a simple call to the OpenCalais API to do entity tagging in a raw document written in French (so there are lots of accented characters). In the returned response, all accented characters are transformed to strange symbols.

I already read the API documentation, I set the header "Content-Type" to "text/raw; charset=utf-8", I checked that the text is certainly encoded with UTF-8.

This is the code I use to read content from a file:

public static String readInput(String filename) {
    Path file = Paths.get(filename);
    Charset charset = Charset.forName("UTF-8");
    String line, content = "";
    try (BufferedReader reader = Files.newBufferedReader(file, charset)) {
        while ((line = reader.readLine()) != null) {
            content += line;
        }
    } catch (IOException x) {
        System.err.format("IOException: %s%n", x);
    }

    return content;
}

Before sending the request, I have printed out the string read from file. It showed the original text with no encoding error.

Here is the code I use to make the request & fetch the response from OpenCalais API:

// make call to the API link
    DefaultHttpClient httpClient = new DefaultHttpClient();
    HttpPost postRq = new HttpPost(url);

    // add necessary headers (custom)
    postRq.addHeader("x-ag-access-token", tokenKey);
    postRq.addHeader("x-calais-language", lang);
    postRq.addHeader("outputFormat", outputFormat);

    // add necessary headers (fixed)
    postRq.addHeader("Content-Type", "text/raw;charset=utf-8");
    postRq.addHeader("x-calais-contentClass", "news");
    postRq.addHeader("Accepted-Charset", "utf-8");

    // pass body content in the call
    StringEntity entityInput = null;
    try {
        entityInput = new StringEntity(text);
        postRq.setEntity(entityInput);
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }

    // execute the call
    HttpResponse response = null;
    try {
        response = httpClient.execute(postRq);
    } catch (IOException e) {
        e.printStackTrace();
    }

    if (response.getStatusLine().getStatusCode() != 200) {
        throw new RuntimeException("Failed : HTTP error code : "
                + response.getStatusLine().getStatusCode());
    }

    // read the response
    String output, result = "";
    BufferedReader br = null;

    try {
        br = new BufferedReader(
                new InputStreamReader(response.getEntity().getContent(), "UTF-8"));
        while ((output = br.readLine()) != null) {
            result += output + "\n";
            System.out.println(output); // !!! the returned text has strange symbols
        }
        br.close();
    } catch (UnsupportedOperationException | IOException e) {
        e.printStackTrace();
    }

    // close the connection
    httpClient.getConnectionManager().shutdown();

Here are several things that I have tried (and yet failed):

  • Rewrite the entire text (without copy paste),
  • Copy the text in Sublime Text, correct all possible accents (I delete accented characters and rewrite them again to make sure there is no unexpected encoding conflict coming from copy-paste), save with encoding UTF-8.

Could you please tell me how to fix it? Thanks!

PS: I posted my question in OpenCalais discussion forum on their website but haven't yet received a solution.

Huong Le
  • 11
  • 2

0 Answers0