2

I'm trying to understand character encoding for Strings in Java. I'm working on Windows 10 and the default character encoding is windows-1251. it is 8-bit encoding character. So it must be 1 byte for 1 symbol. So when I call getBytes() for a String with 6 symbols, I expect an array of 6 bytes. But the following code snippet returns 12, instead of 6.

"Привет".getBytes("windows-1251").length // returns 12

At first, I thought that the first byte of the character must be zero. But both bytes related to the character have non-zero values. Could anyone explain, what I'm missing here, please?

Here is an example of how I tested it

import java.nio.charset.Charset;
import java.io.*;
import java.util.HexFormat;

public class Foo
{
    public static void main(String[] args) throws Exception
    {
        System.out.println(Charset.defaultCharset().displayName());
        String s = "Привет";
        System.out.println("bytes count in windows-1251: " + s.getBytes("windows-1251").length);
        printBytes(s.getBytes("windows-1251"), "windows-1251");
    }
    
    public static void printBytes(byte[] array, String name) {
        for (int k = 0; k < array.length; k++) {
            System.out.println(name + "[" + k + "] = " + "0x" +
                byteToHex(array[k]));
        }
    }

static public String byteToHex(byte b) {
      // Returns hex String representation of byte b
      char hexDigit[] = {
         '0', '1', '2', '3', '4', '5', '6', '7',
         '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
      };
      char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
      return new String(array);
   }
}

the result is:

windows-1251
bytes count in windows-1251: 12
windows-1251[0] = 0xd0
windows-1251[1] = 0x9f
windows-1251[2] = 0xd1
windows-1251[3] = 0x80
windows-1251[4] = 0xd0
windows-1251[5] = 0xb8
windows-1251[6] = 0xd0
windows-1251[7] = 0xb2
windows-1251[8] = 0xd0
windows-1251[9] = 0xb5
windows-1251[10] = 0xd1
windows-1251[11] = 0x82

but what I expect is:

windows-1251
bytes count in windows-1251: 6
windows-1251[0] = 0xcf
windows-1251[1] = 0xf0
windows-1251[2] = 0xe8
windows-1251[3] = 0xe2
windows-1251[4] = 0xe5
windows-1251[5] = 0xf2
Ilyas
  • 33
  • 5
  • Very strange: on my machine this works without problem and yields the expected output. Also running it in an online IDE like https://www.jdoodle.com/online-java-compiler/ also gives the expected output. – Thomas Kläger May 08 '23 at 17:12
  • 1
    Note that `HexFormat.of().withDelimiter(", ").formatHex("Привет".getBytes("UTF-8"))` is `d0, 9f, d1, 80, d0, b8, d0, b2, d0, b5, d1, 82`, perhaps you have compiled with source file of different encoding to windows-1251 – DuncG May 08 '23 at 17:17
  • 1
    For some reason it looks like your vm cannot decode Windows-1251. It's essentially ignored and you get UTF-8. btw if you import `HexFormat` why do you do so much work? ;) – g00se May 08 '23 at 17:18
  • 1
    [Can not reproduce](https://tio.run/##fVDLasMwELznKxaBQSaOSAK9NO2lpcf2kmPIQbFlR7EjGWudxJRA6Y/0HwqFfobzRa78CDGlqS6LdndmZ2bDd3y0CeKq8hNuDDxzqeB1APal@SqRPhjkaMtOywC2dkrnmEkVLZbAs8i4gOtM7w08HXyRotRndP3mhUGxZTpHlloMJopu7D2mpGb@mmdGIHvsaiBCnifYfanLAmnShBcvfCuo684upM15MHAPpPw4vZXf5Wf5dXons//uklWBwoCvc4VgLe6lCqzs0WR6M7kFAkMwLBL4UG9R0p8SlyVCRbjuaWhY293rMA9@NVr8cXAt3R5rrbZJOOOFd7asbBRuL95QZ0AtBGKbxXhmy12L6ATbznDYB/yVTUiJYxZOsKw5Ds54enAU8ZpbHsReS7iIlz33x87Hsap@AA). You should ensure that you source file has the right encoding (and the compiler options matched). Try to verify with `System.out.println(s);` – Holger May 08 '23 at 17:22

1 Answers1

3

It looks like perhaps you have UTF-8 encoded source file when you compiled?

HexFormat.of().withPrefix(", ").formatHex("Привет".getBytes("UTF-8"))
==> "d09fd180d0b8d0b2d0b5d182"

If I save your code in my UTF-8 editor and compile+run:

java -Dfile.encoding=UTF-8 Foo.java
UTF-8
bytes count in windows-1251: 6
windows-1251[0] = 0xcf
windows-1251[1] = 0xf0
windows-1251[2] = 0xe8
windows-1251[3] = 0xe2
windows-1251[4] = 0xe5
windows-1251[5] = 0xf2

Whereas this matches your output if compile+run that UTF8 file with your default encoding:

java -Dfile.encoding=windows-1251 Foo.java
windows-1251
bytes count in windows-1251: 12
windows-1251[0] = 0xd0
windows-1251[1] = 0x9f
windows-1251[2] = 0xd1
windows-1251[3] = 0x80
windows-1251[4] = 0xd0
windows-1251[5] = 0xb8
windows-1251[6] = 0xd0
windows-1251[7] = 0xb2
windows-1251[8] = 0xd0
windows-1251[9] = 0xb5
windows-1251[10] = 0xd1
windows-1251[11] = 0x82

If I change my editor charset to windows-1251 then the output is as expected:

java -Dfile.encoding=windows-1251 Foo.java
windows-1251
bytes count in windows-1251: 6
windows-1251[0] = 0xcf
windows-1251[1] = 0xf0
windows-1251[2] = 0xe8
windows-1251[3] = 0xe2
windows-1251[4] = 0xe5
windows-1251[5] = 0xf2

EDIT

For simplicity above I've used java Foo.java "compile and launch" mode but for normal separate use the important steps are to match up javac with character encoding of the source code, and java with any encoding you want the app to use:

javac -Dfile.encoding=TheJavaFilesCharSet {JavaFiles}
java  -Dfile.encoding=AnyOrDefaultCharSet {ClassWithMain}

As mentioned in the comments its worth using HexFormat, it is immutable and therefore safe to assign to a static field directly or via a user-friendly debugging output method:

private static final HexFormat HEX = HexFormat.ofDelimiter(", ").withPrefix("0x").withUpperCase();
public static String formatHex(byte[] arr) {
    return "new byte[/*"+arr.length+"*/] {"+HEX.formatHex(arr)+"}";
}

HEX.formatHex(new byte[]{1,2,3});
==> "0x01, 0x02, 0x03"

formatHex(new byte[]{1,2,3});
==> "new byte[/*3*/] {0x01, 0x02, 0x03}"

The latter is helpful if you want to cut/paste definitions back into testcases.

DuncG
  • 12,137
  • 2
  • 21
  • 33
  • 2
    Yes, you're absolutely right. I also changed my java file encoding and it worked as expected. Thank you a lot. This shows me how encoding is complicated. When we initialize a variable in code, the value depends on encoding of source(compiled) file as well. I thought, since the variable is initialized and handled by jvm it will take care of everything. Of course, having hardcoded non latin value in code is a bad practice – Ilyas May 08 '23 at 17:46