According to the Java documentation for String.length:
public int length()
Returns the length of this string.
The length is equal to the number of Unicode code units in the string.
Specified by:
length in interface CharSequence
Returns:
the length of the sequence of characters represented by this object.
But then I don't understand why the following program, HelloUnicode.java, produces different results on different platforms. According to my understanding, the number of Unicode code units should be the same, since Java supposedly always represents strings in UTF-16:
public class HelloWorld {
public static void main(String[] args) {
String myString = "I have a in my string";
System.out.println("String: " + myString);
System.out.println("Bytes: " + bytesToHex(myString.getBytes()));
System.out.println("String Length: " + myString.length());
System.out.println("Byte Length: " + myString.getBytes().length);
System.out.println("Substring 9 - 13: " + myString.substring(9, 13));
System.out.println("Substring Bytes: " + bytesToHex(myString.substring(9, 13).getBytes()));
}
// Code from https://stackoverflow.com/a/9855338/4019986
private final static char[] hexArray = "0123456789ABCDEF".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for ( int j = 0; j < bytes.length; j++ ) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
return new String(hexChars);
}
}
The output of this program on my Windows box is:
String: I have a in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 26
Byte Length: 26
Substring 9 - 13:
Substring Bytes: F09F9982
The output on my CentOS 7 machine is:
String: I have a in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: i
Substring Bytes: F09F99822069
I ran both with Java 1.8. Same byte length, different String length. Why?
UPDATE
By replacing the "" in the string with "\uD83D\uDE42", I get the following results:
Windows:
String: I have a ? in my string
Bytes: 4920686176652061203F20696E206D7920737472696E67
String Length: 24
Byte Length: 23
Substring 9 - 13: ? i
Substring Bytes: 3F2069
CentOS:
String: I have a in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: i
Substring Bytes: F09F99822069
Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...
Java Versions:
Windows:
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
CentOS:
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)
Update 2
Using .getBytes("utf-8")
, with the "" embedded in the string literal, here are the outputs.
Windows:
String: I have a in my string
Bytes: 492068617665206120C3B0C5B8E284A2E2809A20696E206D7920737472696E67
String Length: 26
Byte Length: 32
Substring 9 - 13:
Substring Bytes: C3B0C5B8E284A2E2809A
CentOS:
String: I have a in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: i
Substring Bytes: F09F99822069
So yes it appears to be a difference in system encoding. But then that means string literals are encoded differently on different platforms? That sounds like it could be problematic in certain situations.
Also... where is the byte sequence C3B0C5B8E284A2E2809A
coming from to represent the smiley in Windows? That doesn't make sense to me.
For completeness, using .getBytes("utf-16")
, with the "" embedded in the string literal, here are the outputs.
Windows:
String: I have a in my string
Bytes: FEFF00490020006800610076006500200061002000F001782122201A00200069006E0020006D007900200073007400720069006E0067
String Length: 26
Byte Length: 54
Substring 9 - 13:
Substring Bytes: FEFF00F001782122201A
CentOS:
String: I have a in my string
Bytes: FEFF004900200068006100760065002000610020D83DDE4200200069006E0020006D007900200073007400720069006E0067
String Length: 24
Byte Length: 50
Substring 9 - 13: i
Substring Bytes: FEFFD83DDE4200200069