The word endereço should return me length 9 instead of 8.
If you expect to have a size of 9 bytes for the "endereço"
String that has a length of 8 characters : 7 ASCII
characters and 1 not ASCII
character, I suppose that you want to use UTF-8
charset that uses 1 byte for characters included in the ASCII table and more for the others.
but String length method or getting the length of it with the byte
array returned from getBytes method doesn't return special chars
counted as two bytes.
String
length()
method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply char
s are contained in?"
String
length()
Javadoc :
Returns the length of this string. The length is equal to the number
of Unicode code units in the string.
The byte[]
getBytes()
method with no argument encodes the String into a byte array. You could use the length
property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.
But the byte[]
getBytes()
method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.
byte[]
getBytes()
Javadoc :
Encodes this String into a sequence of bytes using the platform's
default charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the
default charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.
In your String example "endereço"
, if getBytes()
returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8
but a charset using 1 byte fixed width by character such as ISO 8859-1
and its derived charsets such as windows-1252
for Windows OS based.
To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset()
.
Solution
byte[]
getBytes()
method comes with two other very useful overloads :
Contrary to the getBytes()
method with no argument, these methods allow to specify the charset to use during the byte encoding.
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
Javadoc :
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the
given charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.
byte[] java.lang.String.getBytes(Charset charset)
Javadoc :
Encodes this String into a sequence of bytes using the given charset,
storing the result into a new byte array.
This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array. The
java.nio.charset.CharsetEncoder class should be used when more control
over the encoding process is required.
You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .
For example to get an UTF-8
encoding byte array by using getBytes(String charsetName)
you can do that :
String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;
And you will get a length of 9 bytes as you wish.
Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8
and UTF-16
:
public static void main(String[] args) throws UnsupportedEncodingException {
// default charset
Charset defaultCharset = Charset.defaultCharset();
System.out.println("default charset = " + defaultCharset);
// String sample
String yourString = "endereço";
// getBytes() with default platform encoding
System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());
// getBytes() with specific charset UTF-8
System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);
System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());
// getBytes() with specific charset UTF-16
System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);
System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}
Output on my machine that is Windows OS based:
default charset = windows-1252
getBytes() with default charset, size = 8
getBytes("UTF-8"), size = 9
getBytes(StandardCharsets.UTF_8), size = 9
getBytes("UTF-16"), size = 18
getBytes(StandardCharsets.UTF_16), size = 18