3

How can I convert one String with characters decoded in codepage 1252 into a String decoded in codepage 1250.

For example

String str1252 = "ê¹ś¿źæñ³ó";
String str1250 = convert(str1252);
System.out.print(str1250);

I want to find such convert() function, that printed output would be:

ęąśżźćńłó

These are Polish-specific characters.

Thank you for any suggestions.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
rafalry
  • 2,620
  • 6
  • 24
  • 39

1 Answers1

6

It's pretty straightforward:

public String convert(String s) {
    return new String(s.getBytes("Windows-1252"), "Windows-1250");
}

Note that System.out.print() can introduce another incorrect conversion due to mismatch between ANSI and OEM code pages. However System.console().writer().print() should output it correctly.

Community
  • 1
  • 1
axtavt
  • 239,438
  • 41
  • 511
  • 482
  • It breaks 'ś' and 'ź' for some reason, though. Maybe because they aren't in the Windows-1250? Not sure as I haven't used any of these encodings. – Sergei Tachenov Jan 31 '11 at 12:47
  • @Sergey: Yes, these characters are absent in [Windows-1250](http://en.wikipedia.org/wiki/Windows-1252). – axtavt Jan 31 '11 at 13:05
  • 1
    Thank you for the answer. It almost works. If the character is > 256 then it wont be converted to bytes properly (it's changed to 63 - '?'). Therefore the additional detection is required for every char of the String. I have found the below code working: bytes[] bytes = text.getBytes("Windows-1252"); String text1250 = new String(bytes, "Windows-1250"); StringBuffer buffer = new StringBuffer(text1250); for (int i = 0 ; i < bytes.length ; i ++){ chars[i] = text.charAt(i); if (chars[i] > 256) buffer.replace(i, i+1, text.substring(i, i+1)); } text1250 = buffer.toString(); – rafalry Jan 31 '11 at 13:25
  • 1
    @rybz, I'd change that to `if (buffer.charAt(i) == '?' && text.charAt(i) != '?')` - that is, if it wasn't '?' but became '?', then take the original char instead. This has the advantage of working correctly even if some chars that aren't > 256 turn out to be broken too. – Sergei Tachenov Jan 31 '11 at 15:02
  • @Sergey, that might be a good point. I haven't found a case of a character <= 256 that would be converted improperly. You also have to make sure, that a char that is <= 256 and converts broken, is a valid type for the target codepage - I haven't checked that solution, but it might be advantageous. – rafalry Feb 17 '11 at 18:24