16

I wanted to use a HashSet<Long> for storing a large list of unique numbers in memory. I calculated the approximate memory to be consumed (in 64 bit pointer size):

Long would take 16 bytes of space. So initially I multiplied the number of entries with 16 to get the memory. But in reality, the memory was much more than 16 bytes per entry. After that I studied HashSet implementation. In short, in the underlying implementation, it actually stores an extra dummy object (12 bytes) with each entry of hashset. And a pointer (8 bytes) to next entry. Thus conceding extra 12+8 bytes per entry.

So total memory per entry: 16+12+8 = 36 bytes. But still when I ran the code and checked the memory, it was still much more than 36 bytes per entry.

My Question(In short): How much memory does a HashSet take (for instance, on 64 bit machine)?

Aqeel Ashiq
  • 1,988
  • 5
  • 24
  • 57
  • 1
    You forgot to take into account the capacity. –  Apr 08 '15 at 15:06
  • it seems ur forgetting JVM memory usage too – nafas Apr 08 '15 at 15:07
  • 1
    http://java-performance.info/memory-consumption-of-java-data-types-2/ – nafas Apr 08 '15 at 15:16
  • 2
    "Long would take 16 bytes of space." Why? http://stackoverflow.com/questions/9037468/what-is-the-storage-cost-for-a-boxed-primitive-in-java – weston Apr 08 '15 at 15:18
  • @nafas Read that article, but it still does not explain why it is multiplying 32 with size, and then 4 with the capacity. – Aqeel Ashiq Apr 08 '15 at 15:18
  • This isn't c though, you can't really count bytes like this. There are overheads, and alignment implementation details that mean even a boxed `int` on 64bit could be 24 bytes! – weston Apr 08 '15 at 15:27
  • Plus the overhead of the actual hash table itself. It's an array of buckets. An array of size N contains N object references (including null elements), which have a certain size. And the table's size is not immediately clear knowing just the number of `Long` objects you added. – Brandon Apr 08 '15 at 15:29
  • I think you can use one of the methods described to figure out the size of the set. You could also write a program that keeps growing the set, say once per second, and use a profiler to see the memory growth. Finally, if space is really a concern for you, I suggest you use [`BitSet`](https://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html) or something like that. Be careful, I remember `BitSet` having a quirky interface with unexpected pitfalls. – Giovanni Botta Apr 08 '15 at 15:30
  • Each bucket is a list of some kind (not sure if it's an array, or an `ArrayList`, or `LinkedList`, or what). An ideal hashing algorithm spreads things out so each bucket/list only has 1 item, but there is potential for a collision, and therefore the buckets must be a data structure which can hold multiple values. – Brandon Apr 08 '15 at 15:30
  • Another article: https://seniorjava.wordpress.com/2013/09/01/java-objects-memory-size-reference/ – FuRioN Apr 08 '15 at 15:31
  • `BitSet` is good for compactly storing integer values which are very close together over a short range. Go ahead and create a `new BitSet()` and turn on value `0` and value `2,000,000,000`. Then check your memory usage. Despite only having 2 values, the object's memory footprint will be huge! However, if you only enable bits 1 - 100, the footprint will be extremely small. – Brandon Apr 08 '15 at 15:34
  • `HashSet` generally takes ~32 bytes per element plus the memory consumption of the element itself. – Louis Wasserman Apr 08 '15 at 16:27

5 Answers5

8

You can measure exactly this size using this test:

    long m1 = Runtime.getRuntime().freeMemory();
    // create object (s) here
    long m2 = Runtime.getRuntime().freeMemory();
    System.out.println(m1 - m2);

to be run with -XX:-UseTLAB option

On my 64-bit HotSpot empty HashSet takes 480 bytes.

Why so much? Because HashSet has a complex structure (btw IDE in debug mode helps see actual fields). It is based on HashMap (Adapter pattern). So HashSet itself contains a reference to a HashMap. HashMap contains 8 fields. Actual data are in an array of Nodes. A Node has: int hash; K key; V value; Node next. HashSet uses only keys and puts a dummy object in values.

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
5

The size of objects is an implementation detail. There is no guarantee that if it's x bytes on one platform, on another it's also x bytes.

Long is boxed as you know, but 16 bytes is wrong. The primitive long takes 8 bytes but the size of the box around the long is implementation dependent. According to this Hotspot related answer overhead words and padding means a boxed 4-byte int can come to 24 bytes!

The byte alignment and padding mentioned in that (Hotspot specific) answer also would apply to the Entry objects which would also push the consumption up.

Community
  • 1
  • 1
weston
  • 54,145
  • 21
  • 145
  • 203
2

The memory used is 32 * SIZE + 4 * CAPACITY + ( 16 * SIZE ) beign "SIZE" the number of elements.

FuRioN
  • 623
  • 5
  • 12
  • do you mean 64 * SIZE? – nafas Apr 08 '15 at 15:09
  • No, is 12 bytes header + 16 bytes data + 4 bytes padding – FuRioN Apr 08 '15 at 15:11
  • 2
    Can you please explain the formula: Why are we multiplying 32 with the size. And 4 with the capacity and then 16 with the size, again. – Aqeel Ashiq Apr 08 '15 at 15:15
  • Please provide a source for that. – Sibbo Apr 08 '15 at 15:15
  • The first 32 bytes as I said in the previous comment are the size for each entry, 4 bytes for the capacity and the last 16 bytes are for the long's values. Better explained: http://java-performance.info/memory-consumption-of-java-data-types-2/ – FuRioN Apr 08 '15 at 15:18
  • So why is the size of the header 32 bytes? And why is the capacity multiplied with 4? And why is the size of a `Long` object 16 bytes? – Sibbo Apr 08 '15 at 15:20
  • The header of an HashSet entry contains a key (12 bytes), a value (16 bytes), and a pointer to the next entry (4bytes) and the capacity is multiplied with 4 for the pointer to each entry. – FuRioN Apr 08 '15 at 15:27
  • @FuRioN can you explain the last term `( 16 * SIZE ) `? Is that supposed to be the amount of memory taken up by the actual `Long` entry data? – acidnbass Dec 03 '20 at 18:22
1

HashMap default size is 16 HashMapEntry entries. Every HashMapEntry has four objects on it (int keyHash, Object next, Object key, Object value). So it introduces overhead just for having empty entries by wrapping the elements. Additionally, hashmap has a expansion rate of 2x, so for 17 elements, you'll have 32 entries with 15 of them empty.

Easier way is check a heapdump with memory analyzer.

Fran Montero
  • 1,679
  • 12
  • 24
1

A HashSet is a complicated beast. Off the top of my head and after reviewing some of the comments, here are some items consuming memory that you have not accounted for:

  1. Java collections (true collections, not plain arrays) can only take object references, not primitives. Therefore, your long primitive gets boxed into a java.lang.Long object and a reference added to the HashSet. Somebody mentioned that aLong` object will be 24 bytes. Plus the reference, which is 8 bytes.
  2. The hash table buckets are collections. I don't recall if they are arrays or ArrayList, or LinkedList, etc., but because hashing algorithms could produce collisions, the elements of the HashSet must be put into collections, which are organized by hash code. Best case is an ArrayList with just 1 element: Your Long object. The default backing array size for ArrayList is 10, so you have 10 object references within the object, so at least 80 bytes now per Long. Since Long is an integer, I suspect the hashing algorithm does a good job spreading things out. I'm not sure what would happen to a long whose value exceeded the Integer.MAX_VALUE. That would have to collide somehow due to the birthday paradox.
  3. The actual hash table - HashSet is basically a HashMap where the value is not interesting. Under the hood, it creates a HashMap, which has an array of buckets in it to represent the hash table. The array size is based on the capacity, which is not clear based on the number of elements you added.
  4. The size of the hash table will usually, intentionally, have more buckets than needed, in order to make future growth easier. Hopefully it's not a lot more. But don't expect that 5 elements takes exactly 5 buckets.

Long-story short, hash tables are a memory-intensive data structure. It's the space/time trade-off. You get, assuming a good hash distribution, constant time look-ups, at the cost of extra memory usage.

Brandon
  • 9,822
  • 3
  • 27
  • 37
  • *"The hash table buckets are collections."* Hash buckets are a linked list, and now possibly a tree structure. They are a package-private subtype of `Map.Entry`, unrelated to any collections visible to us. – Radiodef Apr 08 '15 at 21:09