7

I have a program working on enormous data sets. The objects are best stored on hash implemented containers since the program keeps seeking for objects in the container.

The first idea was to use HashMap since the methods get and remove of this container are more suitable to the uses I need.

But, I came to see the use of HashMap is pretty memory consumable which is a major problem, so i thought switching to HashSet will be better because it only uses <E>, and not <K,V> per element, but when I looked at the implementation i learned it uses an underlying HashMap! this means it wont save any memory!

So this is my questions:

  • Are all my assumptions true?
  • Is HashMap memory wasteful? more specifically, what is its overhead for each entry?
  • Is HashSet just as wasteful as HashMap?
  • Is there any other Hash based containers which will be significantly less memory consumables?

    update

As requested in the comments I will extend a bit on my program, the hashMap is meant to hold a pair of other objects, and some numeric value - a float- calculated from them. along the way it extracts some of them and enters new pairs. Given a pair it needs to ensure it doesnt hold this pair or to remove it. The mapping can be done using the float value or the hashCode of the pair object.

Additionally when i say "enormous data sets" I am talking about ~ 4*10^9 objects

Ravid Goldenberg
  • 2,119
  • 4
  • 39
  • 59
  • what are your assumptions? – SMA Feb 01 '15 at 09:53
  • *which is a major problem*: is it? Have you measured and proven that using a HashSet in your usecase did consume too much memory? What is the use-case? – JB Nizet Feb 01 '15 at 09:59
  • @almasshaikh My assumptions is everything written in my post and specifically the questions which follows... – Ravid Goldenberg Feb 01 '15 at 10:04
  • @JBNizet I haven't tested HashSet since the code using HashMap is already written and working(on smaller data sets) so before I modify it to HashSet doesn't it make since to think if this will even help? – Ravid Goldenberg Feb 01 '15 at 10:06
  • Then have you measured and proven that using a HashMap in your usecase did consume too much memory? What is the use-case? – JB Nizet Feb 01 '15 at 10:07
  • @JBNizet I am not sure what do you mean by "use-case", but, I have tested it, when using the program on bigger data sets the part where I build my HashMap takes more memory than I can afford. – Ravid Goldenberg Feb 01 '15 at 10:10
  • 1
    You start by saying that you're using a HashMap instead of using a HashSet, although they quite different kinds of collections. All we know about your use-case (i.e. what your app must do) is that it must process "enormous data sets". We can't recommend any solution with so few information. Besides, I've seen so many questions asking about "enormous" data sets where "enormous" actually meant "around 1000", which is in fact tiny, that I prefer asking. – JB Nizet Feb 01 '15 at 10:14
  • 1
    @petric, I edited my [answer](http://stackoverflow.com/a/28261619/3208640) although it's not answering all your questions it contains some tips maybe be useful. – void Feb 01 '15 at 11:41
  • @JBNizet Ok I get it now and have dited my post. – Ravid Goldenberg Feb 01 '15 at 12:31
  • @FarhangAmary I saw it thnks. – Ravid Goldenberg Feb 01 '15 at 12:45
  • 3 * 10^10 objects. 30 billion objects. Just having those objects in memory, assuming they have no state at all, would need 16 * 30 billion bytes, so 480 GB of memory. Whatever the collection you use, you won't have enough memory. What you need is some terabytes of storage and a good database. – JB Nizet Feb 01 '15 at 12:59
  • @JBNizet Ya I miscalculated, its 4 *10^9 ...the machine has 256GB – Ravid Goldenberg Feb 01 '15 at 13:45

3 Answers3

15

There are very useful tips on this site about collections performance in java.

HashSet is built on top of a HashMap< T, Object >, where value is a singleton ‘present’ object. It means that the memory consumption of aHashSet is identical to HashMap: in order to store SIZE values, you need 32 * SIZE + 4 * CAPACITY bytes (plus size of your values). It is definitely not a memory-friendly collection.

THashSet could be the easiest replacement collection for a HashSet – it implements Set and Iterable, which means you should just update a single letter in the initialization of your set.

THashSet uses a single object array for its values, so it uses 4 * CAPACITY bytes for storage. As you can see, compared to JDK HashSet, you will save 32 * SIZE bytes in case of the identical load factor, which is a huge improvement.

Also the below image which I took from here can help us keeping something in mind for choosing right collection

enter image description here

void
  • 7,760
  • 3
  • 25
  • 43
5

Are all my assumptions true?

You are correct that HashSet is implemented using HashMap, so you will not save any memory by using HashSet instead.

If you're creating maps with a large number of elements, you should construct your HashMaps with an initialCapacity to the best of your knowledge, in order to prevent repeated rehashing (thus memory thrashing).

Is HashMap memory wasteful? more specifically, what is its overhead for each entry?

No, it's not wasteful. The overhead is an underlying array (size modified by loadFactor), and an Entry object for each key-value pair. In addition to storing a key and value, the entry object also stores a pointer to the next entry in a slot (in case two or more entries are occupying the same slot in the underlying array). The default loadFactor of 0.75 keeps the underlying array size at 133% of the number of entries.

Very specifically, the memory overhead for each entry is:

  • the entry object's reference to the key,
  • the entry object's reference to the value,
  • the entry object's reference to the next entry,
  • and the underlying array's reference to the entry (divided by load factor).

It's very difficult to get much more trim than that for a hash-based collection.

Is HashSet just as wasteful as HashMap?

You will gain no memory efficiency by using HashSet instead of HashMap.

Is there any other Hash based containers which will be significantly less memory consumables?

If your keys are primitives (e.g. ints), there are custom Map and Set implementations out there (in third party libraries) which use more memory-efficient data structures.

Community
  • 1
  • 1
gknicker
  • 5,509
  • 2
  • 25
  • 41
  • Thank you for your answer, when using the word "wastfull" I didn't mean "improperly doing their job" I meant the choice to use them will consume a lot of memory per item because of the use of many references, which is 2 per item beside the size of the actual object and key, am I right? – Ravid Goldenberg Feb 01 '15 at 10:17
  • 1
    You're welcome. I know what you meant. I've updated my answer to be more specific about the overhead. – gknicker Feb 01 '15 at 10:42
1

It is true that HashSet uses just as much memory as HashMap. The difference between the two that HasSet implements Set, i.e., it does not care about any value associated with a key, only the presence or lack thereof of a particular value. HashMap is concerned with storing/retrieving (put/get) of values per key.

While HashMap/HashSet store data in an array that is usually slightly larger that the number of elements, this shold not be too much of a problem because the load factor is .75. This means that a HashMap will grow when the number of elements reaches 75% of the size of the underlying array.

A bigger concern than a large map would be lots of empty maps, since the default size of a HashMap is 16. This can be offset by setting the initial capacity to 0.

You can also use TreeMap instead, however, since TreeMap is based on references instead of an array, you will probably waste even more space, especially with larger maps, besides also losing some speed. The main benefit of TreeMap is that it maintains the keys in an ordered state, so if you need them sorted that is the way to go.

Additionally, TreeMap can be used for programming reasons when you either cannot or do not want to make a custom implementation of the equals and hashCode methods of your key type. You can make a comparator for the key type instead. E.g., to make a map/set based on a case-insensitive String, use String.CASE_INSENSITIVE_ORDER as the comparator of a TreeSet

Joseph K. Strauss
  • 4,683
  • 1
  • 23
  • 40