11

I'm thinking about filling a collection with a large amount of unique objects. How is the cost of an insert in a Set (say HashSet) compared to an List (say ArrayList)?

My feeling is that duplicate elimination in sets might cause a slight overhead.

Will
  • 2,858
  • 6
  • 33
  • 50
  • 1
    If you already have some mechanism that guarantees uniqueness why bother with the set? If you don't and you need to guarantee uniqueness, then a list is definitely not what you want. – Andrew May 18 '11 at 11:55

7 Answers7

11

There is no "duplicate elimination" such as comparing to all existing elements. If you insert into hash set, it's really a dictionary of items by hash code. There's no duplicate checking unless there already are items with the same hash code. Given a reasonable (well-distributed) hash function, it's not that bad.

As Will has noted, because of the dictionary structure HashSet is probably a bit slower than an ArrayList (unless you want to insert "between" existing elements). It also is a bit larger. I'm not sure that's a significant difference though.

Konrad Garus
  • 53,145
  • 43
  • 157
  • 230
  • The duplicate elimination is there, it's just inherent to the data structure. – Michael Borgwardt May 18 '11 at 11:59
  • Right. I mean, there typically is no duplicate elimination that would always compare the newly inserted element to all existing ones (unless you messed up the `hashCode`). – Konrad Garus May 18 '11 at 12:24
  • Thanks for clarification. Still inserting in a list is therefore a tiny little cheaper conceptually, right? – Will May 18 '11 at 15:31
3

You're right: set structures are inherently more complex in order to recognize and eliminate duplicates. Whether this overhead is significant for your case should be tested with a benchmark.

Another factor is memory usage. If your objects are very small, the memory overhead introduced by the set structure can be significant. In the most extreme case (TreeSet<Integer> vs. ArrayList<Integer>) the set structure can require more than 10 times as much memory.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
3

If you're certain your data will be unique, use a List. You can use a Set to enforce this rule.

Sets are faster than Lists if you have a large data set, while the inverse is true for smaller data sets. I haven't personally tested this claim.

Which type of List?
Also, consider which List to use. LinkedLists are faster at adding, removing elements.

ArrayLists are faster at random access (for loops, etc), but this can be worked around using the Iterator of a LinkedList. ArrayLists are are much faster at: list.toArray().

Redandwhite
  • 2,529
  • 4
  • 25
  • 43
  • Not sure that linked lists are fast for insertion... Thought it was O(n) time to search the position, and then constant (and low) time for the insertion itself. LinkedList do not provide random access to the data. Moreover, an iterator does **not** provide random access. – Agemen May 18 '11 at 12:13
  • In reality I suspect it all depends on the implementation, and the OP can easily build their own. The List and Set interfaces obviously provide no concrete code so one can be made faster than the other. That said, I'm not sure how, but I've been hugely impressed with LinkedList, and swapped to it after I found ArrayList to be too slow. I was doing was `add()` and iteration – Redandwhite May 18 '11 at 12:21
  • 2
    That's because of the structure of the LinkedList, and because you regularly need to do an array copy when using add on an ArrayList. LinkedLists are really efficient for insertions at the beginning or end, but definitely not for random access. Insertions are not limited to add operations. – Agemen May 18 '11 at 12:59
  • Many thanks. I've removed the reference to insertion for LinkedLists in my above answer – Redandwhite May 18 '11 at 13:01
  • 1
    The links is broken for me – CharlieB May 18 '18 at 09:13
2

You have to compare concrete implementations (for example HashSet with ArrayList), because the abstract interfaces Set/List don't really tell you anything about performance.

Inserting into a HashSet is a pretty cheap operation, as long as the hashCode() of the object to be inserted is sane. It will still be slightly slower than ArrayList, because it's insertion is a simple insertion into an array (assuming you insert in the end and there's still free space; I don't factor in resizing the internal array, because the same cost applies to HashSet as well).

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
2

If the goal is the uniqueness of the elements, you should use an implementation of the java.util.Set interface. The class java.util.HashSet and java.util.LinkedHashSet have O(alpha) (close to O(1) in the best case) complexity for insert, delete and contains check.

ArrayList have O(n) for object (not index) contains check (you have to scroll through the whole list) and insertion (if the insertion is not in tail of the list, you have to shift the whole underline array).

You can use LinkedHashSet that preserve the order of insertion and have the same potentiality of HashSet (takes up only a bit more of memory).

Alberto
  • 1,569
  • 1
  • 22
  • 41
  • Lists do not have insertion costs of O(n) – Will Jun 29 '11 at 09:08
  • ArrayList yes, because the array must be shifted. In the worst case (insertion in index 0), all elements of the array must be shifted by 1. – Alberto Jun 29 '11 at 10:04
1

Java List:

If you don't have such requirement that you have to keep duplicate or not. Then you can use List instead of Set.

List is an interface in Collection framework. Which extends Collection interface. and ArrayList, LinkedList is the implementation of List interface.

When to use ArrayList or LinkedList

ArrayList: If you have such requirement that in your application mostly work is accessing the data. Then you should go for ArrayList. because ArrayList implements RtandomAccess interface which is Marker Interface. because of Marker interface ArrayList have capability to access the data in O(1) time. and you can use ArrayList over LinkedList where you want to get data according to insertion order.

LinkedList: If you have such requirement that your mostly work is insertion or deletion. Then you should use LinkedList over the ArrayList. because in LinkedList insertion and deletion happen in O(1) time whereas in ArrayList it's O(n) time.

Java Set:

If you have requirement in your application that you don't want any duplicates. Then you should go for Set instead of List. Because Set doesn't store any duplicates. Because Set works on the principle of Hashing. If we add object in Set then first it checks object's hashCode in the bucket if it's find any hashCode present in it's bucked then it'll not add that object.

Vpn_talent
  • 1,290
  • 12
  • 21
1

I don't think you can make this judgement simply on the cost of building the collection. Other things that you need to take into account are:

  • Is the input dataset ordered? Is there a requirement that the output data structure preserves insertion order?
  • Is there a requirement that the output data structure is ordered (or reordered) based on element values?
  • Will the output data structure be subsequently modified? How?
  • Is there a requirement that the output data structure is duplicate free if other elements are added subsequently?
  • Do you know how many elements are likely to be in the input dataset?
  • Can you measure the size of the input dataset? (Or is it provided via an iterator?)
  • Does space utilization matter?

These can all effect your choice of data structure.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216