When accessing Windows System resources (related to audio) I found that Windows provides description strings of said resources using its own charset, while Java treats these strings as it treats all the strings by default: unicode-encoded. So, instead of sensible text I got a bunch of question marks:
????????? ???????? ???????
Using String .codePointAt () method I discovered that these questions actually hide some text with Windows-1252 encoding. Which of course I would like to see. And so my crusade to convert this string into something readable had begun.
Half a day later, after I've rummaged Stackoverflow and Google for related topics I got some progress, but that only led to more questions. So, there's my code:
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import javax.sound.sampled.AudioSystem;
public class Study_Encoding {
//private static final Charset utf8Charset = Charset .forName ("UTF-8");
private static final Charset win1251Charset = Charset .forName ("Windows-1251");
private static final Charset win1252Charset = Charset .forName ("Windows-1252");
public static void main(String[] args) {
String str = AudioSystem .getMixerInfo () [0] .getName ();
System .out .println ("Original string:");
System .out .println (str + "\n");
System .out .println ("Its code-points:");
displayCodePointSequence (str);
System .out .println ("Windows-1251-decoded byte array (wrong):");
byte [] win1251ByteArr = str .getBytes (win1251Charset);
displayByteSequence (win1251ByteArr);
System .out .println ("Windows-1252-decoded byte array (right):");
byte [] win1252ByteArr = str .getBytes (win1252Charset);
displayByteSequence (win1252ByteArr);
System .out .println ("Windows-1252-encoded string (wrong):");
try {
System .out .println (win1252Charset .newDecoder ()
.decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
} catch (Exception e) {
System .out .println ("ERROR:" + e .toString ());
}
System .out .println ("Windows-1251-encoded string (right):");
try {
System .out .println (win1251Charset .newDecoder ()
.decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
} catch (Exception e) {
System .out .println ("ERROR:" + e .toString ());
}
}
private static void displayCodePointSequence (String str) {
if (null == str) {
System .out .println ("No string");
return;
}
if (str .isEmpty ()) {
System .out .println ("Empty string");
return;
}
for (int k = 0; str .length () > k; ++k) {
System .out .print (str .codePointAt (k) + " ");
}
System .out .println ("[" + str .length () + "]\n");
}
private static void displayByteSequence (byte [] byteArr) {
if (null == byteArr) {
System .out .println ("No array");
return;
}
if (0 == byteArr .length) {
System .out .println ("Empty array");
return;
}
for (int k = 0; byteArr .length > k; ++k) {
System .out .print ((((int) byteArr [k]) & 0xFF) + " ");
}
System .out .println ("[" + byteArr .length + "]\n");
}
}
This program produces following output (where the last line is what I want to get all along):
Original string:
????????? ???????? ???????
Its code-points:
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]
Windows-1251-decoded byte array (wrong):
63 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 [26]
Windows-1252-decoded byte array (right):
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]
Windows-1252-encoded string (wrong):
????????? ???????? ???????
Windows-1251-encoded string (right):
Первичный звуковой драйвер
As anyone can see win1251 and win1252 encodings for some reason got mixed. Also, I guess, there is a way to make Java program treat all the strings as strings in some native encoding (which I DO NOT WANT!!!) or at least system-provided as one. So,...
...my questions are:
- How to convert a string? (Which I've solved, I guess)
- What's going on? (With mixed charsets and all else)
- How to do it right? (String acquisition, if not, string conversion)
EDIT:
It seems I have not made it clear, but I'm not talking about content of the text files, but about system-provided strings such as names and descriptions of devices (physical and virtual), maybe file and directory names. In example above string "Первичный звуковой драйвер" should be something like "Default Audio Device" in English Windows.