I'm trying to understand character encoding for Strings in Java. I'm working on Windows 10 and the default character encoding is windows-1251
. it is 8-bit encoding character. So it must be 1 byte for 1 symbol. So when I call getBytes()
for a String with 6 symbols, I expect an array of 6 bytes. But the following code snippet returns 12, instead of 6.
"Привет".getBytes("windows-1251").length // returns 12
At first, I thought that the first byte of the character must be zero. But both bytes related to the character have non-zero values. Could anyone explain, what I'm missing here, please?
Here is an example of how I tested it
import java.nio.charset.Charset;
import java.io.*;
import java.util.HexFormat;
public class Foo
{
public static void main(String[] args) throws Exception
{
System.out.println(Charset.defaultCharset().displayName());
String s = "Привет";
System.out.println("bytes count in windows-1251: " + s.getBytes("windows-1251").length);
printBytes(s.getBytes("windows-1251"), "windows-1251");
}
public static void printBytes(byte[] array, String name) {
for (int k = 0; k < array.length; k++) {
System.out.println(name + "[" + k + "] = " + "0x" +
byteToHex(array[k]));
}
}
static public String byteToHex(byte b) {
// Returns hex String representation of byte b
char hexDigit[] = {
'0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
};
char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
return new String(array);
}
}
the result is:
windows-1251
bytes count in windows-1251: 12
windows-1251[0] = 0xd0
windows-1251[1] = 0x9f
windows-1251[2] = 0xd1
windows-1251[3] = 0x80
windows-1251[4] = 0xd0
windows-1251[5] = 0xb8
windows-1251[6] = 0xd0
windows-1251[7] = 0xb2
windows-1251[8] = 0xd0
windows-1251[9] = 0xb5
windows-1251[10] = 0xd1
windows-1251[11] = 0x82
but what I expect is:
windows-1251
bytes count in windows-1251: 6
windows-1251[0] = 0xcf
windows-1251[1] = 0xf0
windows-1251[2] = 0xe8
windows-1251[3] = 0xe2
windows-1251[4] = 0xe5
windows-1251[5] = 0xf2