46

I have some slides from IBM named : "From Java Code to Java Heap: Understanding the Memory Usage of Your Application", that says, when we use String instead of char[], there is

Maximum overhead would be 24:1 for a single character!

but I am not able to understand what overhead is referred here. Can anybody please help?

Source :

enter image description here

Roman C
  • 49,761
  • 33
  • 66
  • 176
codingenious
  • 8,385
  • 12
  • 60
  • 90

4 Answers4

38

This figure relates to JDK 6- 32-bit.

JDK 6

In pre-Java-7 world strings which were implemented as a pointer to a region of a char[] array:

// "8 (4)" reads "8 bytes for x64, 4 bytes for x32"

class String{      //8 (4) house keeping + 8 (4) class pointer
    char[] buf;    //12 (8) bytes + 2 bytes per char -> 24 (16) aligned
    int offset;    //4 bytes                     -> three int
    int length;    //4 bytes                     -> fields align to
    int hash;      //4 bytes                     -> 16 (12) bytes
}

So I counted:

36 bytes per new String("a") for JDK 6 x32  <-- the overhead from the article
56 bytes per new String("a") for JDK 6 x64.


JDK 7

Just to compare, in JDK 7+ String is a class which holds a char[] buffer and a hash field only.

class String{      //8 (4) + 8 (4) bytes             -> 16 (8)  aligned
    char[] buf;    //12 (8) bytes + 2 bytes per char -> 24 (16) aligned
    int hash;      //4 bytes                         -> 8  (4)  aligned
}

So it's:

28 bytes per String for JDK 7 x32 
48 bytes per String for JDK 7 x64.

UPDATE

For 3.75:1 ratio see @Andrey's explanation below. This proportion falls down to 1 as the length of the string grows.

Useful links:

Community
  • 1
  • 1
Andrey Chaschev
  • 16,160
  • 5
  • 51
  • 68
  • I see what's happening now. Perhaps you should show this is in the answer a little bit. I got confused so I'm sure others may. You're showing the size of a String, but not of a char[1]. Both are sort of necessary to show the ratio – Cruncher Nov 20 '13 at 20:11
  • `480/128 = 3.75` is the ratio for `MyString` and for one-char string it's `368/16 = 23`. And the numbers are just somewhat better with those two fields gone. – Marko Topolnik Nov 20 '13 at 20:27
  • My JDK7 still uses `offset` and `size`. In fact, getting rid of that would be avery bad idea IMO. – Darkhogg Nov 20 '13 at 20:28
  • 1
    @Darkhogg It's been dead and gone since Java 7 Update 6. – Marko Topolnik Nov 20 '13 at 20:28
  • @MarkoTopolnik Is there an official answer to *why?*? – Darkhogg Nov 20 '13 at 20:29
  • 1
    @Darkhogg There was something on the mailing lists; the point is it caused more damage than good. – Marko Topolnik Nov 20 '13 at 20:30
  • @Darkhogg to avoid memory leaks when referencing string segments via `String.substring`. I.e. you have a huge string, so when you do a substring, the small string still references the large buffer and it's not GC-ed even if the reference to the original string is gone. – Andrey Chaschev Nov 20 '13 at 20:30
  • @AndreyChaschev Doesn't that introduces new memory **AND** computing overheads when doing substring? It may turn `O(n)` old code using substring into `O(n²)`... :/ – Darkhogg Nov 20 '13 at 20:33
  • The ratio is about gross bits allocated for a `String` instance vs. net bits in the characters themselves (not the `char[]` as a whole). – Marko Topolnik Nov 20 '13 at 20:33
  • 1
    @Darkhogg Yes, tough luck, it hurts *some* use cases. On the other hand, it is more transparent and predictable and more space-efficient for small strings, which means for 99% of all strings used in Java programs. The net effect is probably less heap usage. – Marko Topolnik Nov 20 '13 at 20:34
  • @Darkhogg New strings are just very simple, and it's nice. For other cases there are `StringBuilder`, `char[]` and a `MutableString` somewhere in Maven Central. – Andrey Chaschev Nov 20 '13 at 20:36
  • 1
    This rationale for the change is briefly described at http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-May/010257.html – meriton Nov 24 '13 at 16:43
  • I don't follow the math. Neglecting the alignment thing, 8+8+12+(2*6)+4=44 not 34. 4+4+8+(2*6)+4=32 not 22. Both are off by 10. What am I missing? – punstress Nov 28 '13 at 21:05
  • @punstress these are the numbers for JDK 6 and they are right above JDK7. I think I will to add headers to highlight this. – Andrey Chaschev Nov 28 '13 at 22:05
9

In the JVM, a character variable is stored in a single 16-bit memory allocation and changes to that Java variable overwrite that same memory location.This makes creating or updating character variables very fast and memory-cheap, but increases the JVM's overhead compared to the static allocation as used in Strings.

The JVM stores Java Strings in a variable size memory space (essentially, an array), which is exactly the same size (plus 1, for the string termination character) of the string when the String object is created or first assigned a value. Thus, an object with initial value "HELP!" would be allocated 96 bits of storage ( 6 characters, each 16-bits in size). This value is considered immutable, allowing the JVM to inline references to that variable, making static string assignments very fast, and very compact, plus very efficient from the JVM point of view.

Reference

Shoaib Chikate
  • 8,665
  • 12
  • 47
  • 70
  • 1
    I don't really think the JVM needs the terminating char though – ratchet freak Nov 20 '13 at 13:27
  • @ratchetfreak Note that if you have the null terminator you can easily, under the hood of the JVM, use some C library's functions to operate on the strings. At least, this was *one* reason why Python implements strings *with* a string length field *and* null terminator. Might be the same reason for Java. In general sometimes it's convenient to have some redundancy. – Bakuriu Nov 20 '13 at 17:18
  • 1
    That's not much of a reference. `char[]` doesn't store the zero terminator. Python is another story, it's much more C-oriented. – Marko Topolnik Nov 20 '13 at 20:40
  • @MarkoTopolnik it may be that when you allocate a char[n] the jvm will allocate an array with an extra spot for the null terminator, but that is an implementation detail – ratchet freak Nov 21 '13 at 08:25
3

I'll try explaining the numbers referenced in the source article.

The article describes object metadata typically consisting of: class, flags and lock.

The class and lock are stored in the object header and take 8 bytes on 32bit VM. I haven't found though any information about JVM implementations which has flags info in the object header. It might be so that this is stored somewhere externally (e.g. by garbage collector to count references to the object etc.).

So let's assume that the article talks about some x32 AbstractJVM which uses 12 bytes of memory to store meta information about the object.

Then for char[] we have:

  • 12 bytes of meta information (8 bytes on x32 JDK 6, 16 bytes on x64 JDK)
  • 4 bytes for array size
  • 2 bytes for each character stored
  • 2 bytes of alignment if characters number is odd (on x64 JDK: 2 * (4 - (length + 2) % 4))

For java.lang.String we have:

  • 12 bytes of meta information (8 bytes on x32 JDK6, 16 bytes on x64 JDK6)
  • 16 bytes for String fields (it is so for JDK6, 8 bytes for JDK7)
  • memory needed to store char[] as described above

So, let's count how much memory is needed to store "MyString" as String object:

12 + 16 + (12 + 4 + 2 * "MyString".length + 2 * ("MyString".length % 2)) = 60 bytes.

From other side we know that to store only the data (without information about the data type, length or anything else) we need:

2 * "MyString".length = 16 bytes

Overhead is 60 / 16 = 3.75

Similarly for single character array we get the 'maximum overhead':

12 + 16 + (12 + 4 + 2 * "a".length + 2 * ("a".length % 2)) = 48 bytes
2 * "a".length = 2 bytes
48 / 2 = 24

Following the article authors' logic ultimately the maximum overhead of value infinity is achieved when we store an empty string :).

Andrey
  • 6,526
  • 3
  • 39
  • 58
1

I had read from old stackoverflow answer not able to get it. In Oracle's JDK a String has four instance-level fields:

A character array
An integral offset
An integral character count
An integral hash value

That means that each String introduces an extra object reference (the String itself), and three integers in addition to the character array itself. (The offset and character count are there to allow sharing of the character array among String instances produced through the String#substring() methods, a design choice that some other Java library implementers have eschewed.) Beyond the extra storage cost, there's also one more level of access indirection, not to mention the bounds checking with which the String guards its character array.

If you can get away with allocating and consuming just the basic character array, there's space to be saved there. It's certainly not idiomatic to do so in Java though; judicious comments would be warranted to justify the choice, preferably with mention of evidence from having profiled the difference.

chiru
  • 812
  • 5
  • 17
  • 32