1

When accessing Windows System resources (related to audio) I found that Windows provides description strings of said resources using its own charset, while Java treats these strings as it treats all the strings by default: unicode-encoded. So, instead of sensible text I got a bunch of question marks:

????????? ???????? ???????

Using String .codePointAt () method I discovered that these questions actually hide some text with Windows-1252 encoding. Which of course I would like to see. And so my crusade to convert this string into something readable had begun.

Half a day later, after I've rummaged Stackoverflow and Google for related topics I got some progress, but that only led to more questions. So, there's my code:

import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import javax.sound.sampled.AudioSystem;


public class Study_Encoding {
    
    //private static final Charset utf8Charset = Charset .forName ("UTF-8");
    private static final Charset win1251Charset = Charset .forName ("Windows-1251");
    private static final Charset win1252Charset = Charset .forName ("Windows-1252");
    
    public static void main(String[] args) {
        
        String str = AudioSystem .getMixerInfo () [0] .getName ();
        
        System .out .println ("Original string:");
        System .out .println (str + "\n");
        
        System .out .println ("Its code-points:");
        displayCodePointSequence (str);
        
        System .out .println ("Windows-1251-decoded byte array (wrong):");
        byte [] win1251ByteArr = str .getBytes (win1251Charset);
        displayByteSequence (win1251ByteArr);
        
        System .out .println ("Windows-1252-decoded byte array (right):");
        byte [] win1252ByteArr = str .getBytes (win1252Charset);
        displayByteSequence (win1252ByteArr);
        
        System .out .println ("Windows-1252-encoded string (wrong):");
        try {
            System .out .println (win1252Charset .newDecoder ()
                    .decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
        } catch (Exception e) {
            System .out .println ("ERROR:" + e .toString ());
        }
        
        System .out .println ("Windows-1251-encoded string (right):");
        try {
            System .out .println (win1251Charset .newDecoder ()
                    .decode (ByteBuffer .wrap (win1252ByteArr)) .toString () + "\n");
        } catch (Exception e) {
            System .out .println ("ERROR:" + e .toString ());
        }
    }
    
    private static void displayCodePointSequence (String str) {
        
        if (null == str) {
            System .out .println ("No string");
            return;
        }
        if (str .isEmpty ()) {
            System .out .println ("Empty string");
            return;
        }
        for (int k = 0; str .length () > k; ++k) {
            System .out .print (str .codePointAt (k) + " ");
        }
        System .out .println ("[" + str .length () + "]\n");
    }
    
    private static void displayByteSequence (byte [] byteArr) {
        
        if (null == byteArr) {
            System .out .println ("No array");
            return;
        }
        if (0 == byteArr .length) {
            System .out .println ("Empty array");
            return;
        }
        for (int k = 0; byteArr .length > k; ++k) {
            System .out .print ((((int) byteArr [k]) & 0xFF) + " ");
        }
        System .out .println ("[" + byteArr .length + "]\n");
    }
}

This program produces following output (where the last line is what I want to get all along):

Original string:
????????? ???????? ???????

Its code-points:
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]

Windows-1251-decoded byte array (wrong):
63 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 63 32 63 63 63 63 63 63 63 [26]

Windows-1252-decoded byte array (right):
207 229 240 226 232 247 237 251 233 32 231 226 243 234 238 226 238 233 32 228 240 224 233 226 229 240 [26]

Windows-1252-encoded string (wrong):
????????? ???????? ???????

Windows-1251-encoded string (right):
Первичный звуковой драйвер

As anyone can see win1251 and win1252 encodings for some reason got mixed. Also, I guess, there is a way to make Java program treat all the strings as strings in some native encoding (which I DO NOT WANT!!!) or at least system-provided as one. So,...

...my questions are:

  1. How to convert a string? (Which I've solved, I guess)
  2. What's going on? (With mixed charsets and all else)
  3. How to do it right? (String acquisition, if not, string conversion)

EDIT:

It seems I have not made it clear, but I'm not talking about content of the text files, but about system-provided strings such as names and descriptions of devices (physical and virtual), maybe file and directory names. In example above string "Первичный звуковой драйвер" should be something like "Default Audio Device" in English Windows.

1 Answers1

1

This is a convoluted question, but the basics are:

  1. There's no such thing as a string without encoding. The most common form (the c-string) uses ASCII encoding. Java natively uses UTF16.
  2. There's no perfect encoding conversion between certain character sets. For instance ASCII -> EBCDIC -> ASCII results in a corrupt string due to the lack of a 1:1 relationship between these character sets.
  3. To me, it seems the file contains data in 1 character set, and you are wanting to convert it to the Java native form (UTF16). This is very simple. You can use a FileInputStream to read the byte data. You can use a Reader to read in String data. Hence you want your reader to perform the conversion: https://docs.oracle.com/javase/8/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.Charset)

So basically, the code you are after is something like:

try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(myFile), StandardCharsets.CHARSETOFCHOICE)))
{
   String line;
   while ((line = br.readLine()) != null)
   {
      // Do what you want with the string.
   }
}

I will reiterate that the conversion may be imperfect depending on the source/target character set and may lead to corruption.

John
  • 800
  • 5
  • 11