8

I have a requirement to store 2 to 15 million Accounts (which are a String of length 15) in a data structure for lookup purpose and checking uniqueness. Initially I planned to store them in a HashSet, but I doubt the speed of the lookup will be slow because of hash collisions and will eventually be slower than a TreeMap (using Binary search).

There is no requirement for Data to be sorted. I am using Java 7. I have 64G system with 48G dedicated for this application.

This question is not a duplicate of HashSet and TreeSet performance test because that question is about the performance of adding elements to a Set and this question is about the performance of checking an existing Set for duplicate values.

Community
  • 1
  • 1
Mohan
  • 282
  • 2
  • 16
  • 1
    also refer [this](http://stackoverflow.com/questions/1463284/hashset-vs-treeset) – Ankur Singhal Aug 04 '15 at 04:31
  • Hi Ankuar, Thanks. The performance test in the link is based on 500K Integers in already sorted order. I have 10 million strings and wanted to understand the possiblity of hash collision. In the second link there is hint which was hwlpful. I will try and post my observations back. – Mohan Aug 04 '15 at 04:40
  • The lookup is for checking if a particular string is present in the Set of strings. This is a Standalone java program and cannot afford to use something like Redis to store the data. – Mohan Aug 04 '15 at 04:59

2 Answers2

14

If you have 48 GB of dedicated Memory for your 2 million to 15 million records, your best bet is probably to use a HashMap<Key, Record>, where your key is an Integer or a String depending on your requirements.

You will be fine as far as hash collisions go as long as you give enough memory to the Map and have an appropriate load factor.

I recommend using the following constructor: new HashMap<>(13_000_000); (30% more than your expected number of records - which will be automatically expanded by HashMap's implementation to 2^24 cells). Tell your application that this Map will be very large from the get-go so it doesn't need to automatically grow as you populate it.

HashMap uses an O(1) access time for it's members, whereas TreeMap uses O(log n) lookup time, but can be more efficient with memory and doesn't need a clever hashing function. However, if you're using String or Integer keys, you don't need to worry about designing a hashing function and the constant time lookups will be a huge improvement. Also, another advantage of TreeMap / TreeSet is the sorted ordering, which you stated you don't care about; use HashMap.

If the only purpose of the list is to check for unique account numbers, then everything I've said above is still true, but as you stated in your question, you should use a HashSet<String>, not a HashMap. The performance recommendations and constructor argument is still applicable.

Further reading: HashSet and TreeSet performance test

Community
  • 1
  • 1
durron597
  • 31,968
  • 17
  • 99
  • 158
  • Thank you very much. In case of a dynamically growing dataset, where i dont know the exact number of elements, may i know what would be better. The dataset can contain 2 million to say 15 million (Exact size unknown) – Mohan Aug 04 '15 at 04:53
  • @Mohan There is no difference if you have that much memory available. If your upper bound is that low compared to your amount of memory, just make the largest reasonable `HashMap` - 2^24 bits - and you will be fine. – durron597 Aug 04 '15 at 04:55
2

When we tried to store 50 million records in HashMap with proper initialization parameters, insertion started to slowdown, especially after 35 million records. Changing to TreeMap gave a constant insertion and retrieval performance.

Observation : TreeMap will give better performance than a HashMap for large input set. For a smaller set, of course HashMap will give better performance.

Mohan
  • 282
  • 2
  • 16
  • 1
    You didn't ask about 50 million records, you asked about 15 million records. At some point you need to think about your hash function and the likelihood of collisions if you key is just `String`, the default implementation is good for most purposes but it might not be good for 50 million strings. – durron597 May 19 '16 at 17:54