1
public class ClassToTestSnippets {

    private static ClassToTestSnippets ctts;

    public static void main(String[] args) {
        ctts = new ClassToTestSnippets();
        ctts.testThisMethod();
    }

    public void testThisMethod() {
        System.out.println("\u2014".length()); //answer is 1
    }
}

Above code prints 1. But \u2014 is E2 80 94 i.e. 3 bytes. How do I know how many bytes does a string contains?

Suresh Subedi
  • 660
  • 2
  • 10
  • 25
  • 1
    This is like looking at a screenshot (not a file, just the displayed image) and asking how big the file is. The answer in both cases is that it depends how it's encoded... – Jon Skeet Oct 21 '14 at 14:25
  • 1
    Also see: http://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16 – Puce Oct 21 '14 at 14:35

2 Answers2

9

Depends. What encoding do you want to use?

System.out.println("äö".getBytes("UTF-8").length);

Prints 4, but if I change UTF-8 to ISO-8859-1 (for example), it'll print 2. Other encodings may print other values (try UTF-32).

Kayaman
  • 72,141
  • 5
  • 83
  • 121
  • This is not the number of bytes in the actual String object, though. It's the number of bytes in the UTF-8 representation of the string. – RealSkeptic Oct 21 '14 at 14:30
4

Internally - it contains (number of chars) * 2 bytes, as each char in Java takes up two bytes (a normal character in Java is 16 bits unicode). The actual bytes are 0x20 and 0x14.

However, the length function returns the number of characters, not the number of bytes.

RealSkeptic
  • 33,993
  • 7
  • 53
  • 79
  • 1
    How do I store UTF-32? What happen then? – Suresh Subedi Oct 21 '14 at 14:34
  • Internally it contains more than that. You're counting only the size of the internal `char[]` here, the size of the whole `String` is bigger. – Kayaman Oct 21 '14 at 14:44
  • You use two surrogate characters for any character that's above the 0xFFFF code point. See this [tutorial](http://docs.oracle.com/javase/tutorial/i18n/text/unicode.html) from Oracle. – RealSkeptic Oct 21 '14 at 14:44
  • @Kayaman Indeed. But the number of additional bytes is constant and not dependent on the actual character data inside it. – RealSkeptic Oct 21 '14 at 14:50