How to count String bytes properly?

Question

A java string containing special chars such as ç takes two bytes of size in each special char, but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.

How can I count correctly the number of bytes in a String?

Example:

The word endereço should return me length 9 instead of 8.

When I run `System.out.println("endereço".getBytes().length);` it prints "9". — briarheart, Apr 03 '17 at 22:10
@briarheart which version of Java? In Java 7 I'm getting eight. — Philippe Gioseffi, Apr 03 '17 at 22:12
@briarheart `getBytes()` uses the platform default encoding, which may already be `UTF-8`. See: [Platform's default charset on different platforms?](http://stackoverflow.com/questions/9312816/java-platforms-default-charset-on-different-platforms) — avojak, Apr 03 '17 at 22:13
I am using Java 8. I suppose "utf-8" is a default encoding for any version of Java unless this behavior is overridden explicitly. — briarheart, Apr 03 '17 at 22:23
Define _special chars_. What makes you think it takes _two bytes of size_? Where? Do you mean in the `char[]` backing the `String`? _The word endereço should return me length 9 instead of 8._ Why? Why not 32? — Sotirios Delimanolis, Apr 03 '17 at 22:27
@briarheart UTF-8 is *not* the default encoding for any version of Java. The default encoding is generally define by the OS, and is usually UTF-8 on Linux, but rarely on WIndows. — Andreas, Apr 03 '17 at 22:28
Length depends greatly on encoding, e.g. for `endereço` it's `ISO-8859-1`: 8, `UTF-8`: 9, `EUC-JP`: 10, `UTF-16BE`: 16, `UTF-32`: 32 — Andreas, Apr 03 '17 at 22:30
@Andreas Yes, you are right. I see "file.encoding" property with value "UTF-8" even if I did not specify it. Explicit fallback for "UTF-8" exists only in the code of `java.nio.charset.Charset` class. — briarheart, Apr 03 '17 at 22:40
I wasn't getting the correct length because my default encoding is `ISO-8859-1`. — Philippe Gioseffi, Apr 03 '17 at 22:40
Again, define _length_. The `String#length()` method has a very specific definition. — Sotirios Delimanolis, Apr 03 '17 at 22:43

davidxxx · Accepted Answer · 2018-02-07T19:20:08.117

The word endereço should return me length 9 instead of 8.

If you expect to have a size of 9 bytes for the "endereço" String that has a length of 8 characters : 7 ASCII characters and 1 not ASCII character, I suppose that you want to use UTF-8 charset that uses 1 byte for characters included in the ASCII table and more for the others.

but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.

String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply chars are contained in?"

String length() Javadoc :

Returns the length of this string. The length is equal to the number of Unicode code units in the string.

The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding. But the byte[] getBytes() method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.

byte[] getBytes() Javadoc :

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

In your String example "endereço", if getBytes() returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8 but a charset using 1 byte fixed width by character such as ISO 8859-1 and its derived charsets such as windows-1252 for Windows OS based.

To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset().

Solution

byte[] getBytes() method comes with two other very useful overloads :

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
byte[] java.lang.String.getBytes(Charset charset)

Contrary to the getBytes() method with no argument, these methods allow to specify the charset to use during the byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc :

Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the given charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

byte[] java.lang.String.getBytes(Charset charset) Javadoc :

Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .

For example to get an UTF-8 encoding byte array by using getBytes(String charsetName) you can do that :

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;

And you will get a length of 9 bytes as you wish.

Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8 and UTF-16 :

public static void main(String[] args) throws UnsupportedEncodingException {

    // default charset
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println("default charset = " + defaultCharset);

    // String sample
    String yourString = "endereço";

    //  getBytes() with default platform encoding
    System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

    // getBytes() with specific charset UTF-8
    System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);       
    System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

    // getBytes() with specific charset UTF-16      
    System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);     
    System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}

Output on my machine that is Windows OS based:

default charset = windows-1252

getBytes() with default charset, size = 8

getBytes("UTF-8"), size = 9

getBytes(StandardCharsets.UTF_8), size = 9

getBytes("UTF-16"), size = 18

getBytes(StandardCharsets.UTF_16), size = 18

"String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many characters are contained?"" no it returns the number of UTF-16 code units in the string. There can be multiple code units per code point and there can be multiple code points per "grapheme cluster" (what most users would consider a character). — plugwash, Feb 07 '18 at 15:59
@ plugwash Technically speaking, yes you are right. I would too much vulgarized I think. I would have been more specific : "how many `char` are contained?" I updated. Thanks for this relevant remark :) — davidxxx, Feb 07 '18 at 19:19

How to count String bytes properly?

1 Answers1

Linked