1

I'm facing an issue with character encoding in linux. I'm retrieving a content from amazon S3, which was saved using UTF-8 encoding. The content is in Chinese and I'm able to see the content correctly in the browser.

I'm using amazon SDK to retrieve the content and do some update to it.Here's the code I'm using:


StringBuilder builder = new StringBuilder();
S3Object object = client.getObject(new GetObjectRequest(bucketName, key));
        BufferedReader reader = new BufferedReader(new 
                InputStreamReader(object.getObjectContent(), "utf-8"));
while (true) {
    String line = reader.readLine();
    if (line == null) 
        break;
    builder.append(line);
}

This piece of code works fine in Windows environment as I was able to update the content and save it back without messing up any chinese characters in it.

But, its acting differently in linux enviroment. The code is unable to translate the characters properly, the chinese characters are rendered as ???

I'm not sure what's going wrong here. Any pointers will be appreciated.

-Thanks

Shamik
  • 1,671
  • 11
  • 36
  • 64
  • 2
    When you say the characters are *rendered* as ???, where are you seeing these rendered? Perhaps the data is fine but you're trying to display them in an environment that doesn't support Unicode or in a font that doesn't have the proper glyphs. – Jacob May 13 '11 at 00:29
  • 2
    That code looks fine. It's probably your terminal that needs to be in UTF-8 mode to display the characters, or you're outputting the wrong encoding, probably using the platform default encoding which might not be UTF-8. Show us the code you use to output the characters, and tell us what terminal you're using. – Christoffer Hammarström May 13 '11 at 00:29
  • When you say the characters are not showing up properly, are you outputting them to a console? If so, what type of console? – onteria_ May 13 '11 at 00:30
  • Its not about the display. I'm add some text back to the content and then save it back to S3. The chinese characters look fine if I do the process in windows and look up the updated data in S3. But if it gets processed in Linux, then the characters just turn to ??? . I'm viewing it in browser using the S3 link. – Shamik May 13 '11 at 00:38
  • Maybe, I should be a little bit precise. After I retrieve the content, I'm adding few more chinese characters to the content and saving it back to S3. The new characters which I added is looking good.The existing ones are the one which is getting messed up.I'm sort of clueless at this weird behaviour. – Shamik May 13 '11 at 00:43
  • Show the code which is doing the "saving back". (Also, try to use `"UTF-8"` instead of `"utf-8"`.) – Paŭlo Ebermann May 13 '11 at 00:54

1 Answers1

4

The default charset is different for the 2 OS's your using.

To start off, you can confirm the difference by printing out the default charset.

Charset.defaultCharset.name()

Somewhere in your code, I think this default charset is being used for some String conversion. The correct procedure should be to track that down, and specify UTF-8.

Without seeing that code, I can only suggest the 'cheating' way to do it: set the default charset explicitly, near the beginning of your code, or at Java startup. See here for changing default charset: Setting the default Java character encoding?

HTH

Community
  • 1
  • 1
laher
  • 8,860
  • 3
  • 29
  • 39
  • . thanks for your input. Charset.defaultCharset.name() --> shows US_ASCII. Now, if I update the .bashrc and add LANG=en_US.UTF-8, it works fine. But I want to do this programatically instead of setting it at bash profile. Not sure, why the encoding to UTF-8 doesn't solve the issue. I even tried encoding the strings to utf-8. Is there a way to override the default character set in java? – Shamik May 13 '11 at 21:10
  • 1
    Hi Shamik, You said you found a way to solve this issue. Currently I`m facing exactly the same. Could you please explain how you solved it? – heiningair Jan 23 '15 at 14:25