I'm trying to make a simple call to the OpenCalais API to do entity tagging in a raw document written in French (so there are lots of accented characters). In the returned response, all accented characters are transformed to strange symbols.
I already read the API documentation, I set the header "Content-Type" to "text/raw; charset=utf-8", I checked that the text is certainly encoded with UTF-8.
This is the code I use to read content from a file:
public static String readInput(String filename) {
Path file = Paths.get(filename);
Charset charset = Charset.forName("UTF-8");
String line, content = "";
try (BufferedReader reader = Files.newBufferedReader(file, charset)) {
while ((line = reader.readLine()) != null) {
content += line;
}
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
return content;
}
Before sending the request, I have printed out the string read from file. It showed the original text with no encoding error.
Here is the code I use to make the request & fetch the response from OpenCalais API:
// make call to the API link
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost postRq = new HttpPost(url);
// add necessary headers (custom)
postRq.addHeader("x-ag-access-token", tokenKey);
postRq.addHeader("x-calais-language", lang);
postRq.addHeader("outputFormat", outputFormat);
// add necessary headers (fixed)
postRq.addHeader("Content-Type", "text/raw;charset=utf-8");
postRq.addHeader("x-calais-contentClass", "news");
postRq.addHeader("Accepted-Charset", "utf-8");
// pass body content in the call
StringEntity entityInput = null;
try {
entityInput = new StringEntity(text);
postRq.setEntity(entityInput);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
// execute the call
HttpResponse response = null;
try {
response = httpClient.execute(postRq);
} catch (IOException e) {
e.printStackTrace();
}
if (response.getStatusLine().getStatusCode() != 200) {
throw new RuntimeException("Failed : HTTP error code : "
+ response.getStatusLine().getStatusCode());
}
// read the response
String output, result = "";
BufferedReader br = null;
try {
br = new BufferedReader(
new InputStreamReader(response.getEntity().getContent(), "UTF-8"));
while ((output = br.readLine()) != null) {
result += output + "\n";
System.out.println(output); // !!! the returned text has strange symbols
}
br.close();
} catch (UnsupportedOperationException | IOException e) {
e.printStackTrace();
}
// close the connection
httpClient.getConnectionManager().shutdown();
Here are several things that I have tried (and yet failed):
- Rewrite the entire text (without copy paste),
- Copy the text in Sublime Text, correct all possible accents (I delete accented characters and rewrite them again to make sure there is no unexpected encoding conflict coming from copy-paste), save with encoding UTF-8.
Could you please tell me how to fix it? Thanks!
PS: I posted my question in OpenCalais discussion forum on their website but haven't yet received a solution.