Is creating a HashMap alongside an ArrayList just for constant-time contains() a valid strategy?

Question

I've got an ArrayList that can be anywhere from 0 to 5000 items long (pretty big objects, too).

At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).

Is creating a HashMap alongside this ArrayList, to achieve constant-time lookup, a valid strategy here, in order to reduce the complexity to O(n)? Or is the overhead of another data structure simply not worth it? I believe it would take up no additional space (besides for the references).

(I know, I'm sure 'it depends on what I'm doing', but I'm seriously wondering if there's any drawback that makes it pointless, or if it's actually a common strategy to use. And yes, I'm aware of the quote about prematurely optimizing. I'm just curious from a theoretical standpoint).

_pretty big objects, too_ An `ArrayList` doesn't store objects, it stores references. — Sotirios Delimanolis, Jul 29 '15 at 15:36
It won't get really bigger in memory as says @Sotirios Delimanolis — Toilal, Jul 29 '15 at 15:37
It is a common strategy, an equivalent of a database index. Side point: you actually need a `HashSet`. — Marko Topolnik, Jul 29 '15 at 15:40
Unless you need a particular iteration order somewhere in your code, why not use a `HashSet` for these objects? — Chthonic Project, Jul 29 '15 at 15:41
A particular iteration order can be achieded with `LinkedHashSet`. So you need an `ArrayList` if you need random access. — user140547, Jul 29 '15 at 15:43
Good point @user140547. Another note that might help OP: maintain a `transient int` field for the hashcode for these objects so that the hashcode doesn't get recomputed for every equality check. — Chthonic Project, Jul 29 '15 at 15:50
very useful, thanks. @SotiriosDelimanolis, do all Java data structures store references instead of raw Objects, except for an array, which is always contiguous buckets for its objects/primitives? — bob, Jul 29 '15 at 15:55
As a developer, you never have access to an object directly, everything is done **through** references, even for array elements. This contiguous memory only holds references to objects. — Sotirios Delimanolis, Jul 29 '15 at 15:56

score 3 · Accepted Answer · answered Jul 29 '15 at 18:15

First of all, a short side note:

And yes, I'm aware of the quote about prematurely optimizing.

What you are asking about here is not "premature optimization"!

You are not talking about replacing a multiplication with some odd bitwise operations "because they are faster (on a 90's PC, in a C-program)". You are thinking about the right data structure for your application pattern. You are considering the application cases (though you did not tell us many details about them). And you are considering the implications that the choice of a certain data structure will have on the asymptotic running time of your algorithms. This is planning, or maybe engineering, but not "premature optimization".

That being said, and to tell you what you already know: It depends.

To elaborate this a bit: It depends on the actual operations (methods) that you perform on these collections, how frequently you perform then, how time-critical they are, and how memory-sensitive the application is.

_{(For 5000 elements, the latter should not be a problem, as only references are stored - see the discussion in the comments)}

In general, I'd also be hesitant to really store the Set alongside the List, if they are always supposed to contain the same elements. This wording is intentional: You should always be aware of the differences between both collections. Primarily: A Set can contain each element only once, whereas a List may contain the same element multiple times.

For all hints, recommendations and considerations, this should be kept in mind.

But even if it is given for granted that the lists will always contain elements only once in your case, then you still have to make sure that both collections are maintained properly. If you really just stored them, you could easily cause subtle bugs:

private Set<T> set = new HashSet<T>();
private List<T> list = new ArrayList<T>();

// Fine
void add(T element)
{
    set.add(element);
    list.add(element);
}

// Fine
void remove(T element)
{
    set.remove(element);
    list.remove(element); // May be expensive, but ... well
}

// Added later, 100 lines below the other methods:
void removeAll(Collection<T> elements)
{
    set.removeAll(elements);
    // Ooops - something's missing here...
}

To avoid this, one could even consider to create a dedicated collection class - something like a FastContainsList that combines a Set and a List, and forwards the contains call to the Set. But you'll qickly notice that it will be hard (or maybe impossible) to not violate the contracts of the Collection and List interfaces with such a collection, unless the clause that "You may not add elements twice" becomes part of the contract...

So again, all this depends on what you want to do with these methods, and which interface you really need. If you don't need the indexed access of List, then it's easy. Otherwise, referring to your example:

At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).

You can avoid this by creating the sets locally:

static <T> List<T> computeIntersection(List<T> list0, List<T> list1)
{
    Set<T> set0 = new LinkedHashSet<T>(list0);
    Set<T> set1 = new LinkedHashSet<T>(list1);
    set0.retainAll(set1);
    return new ArrayList<T>(set0);
}

This will have a running time of O(n). Of course, if you do this frequently, but rarely change the contents of the lists, there may be options to avoid the copies, but for the reason mentioned above, maintainng the required data structures may become tricky.

thanks a ton, very informative! Bonus question: if I've got an ArrayList, and I call myArrayList.removeAll(myHashSet), does the operation take advantage of the O(1) lookup time of myHashSet? The reason I ask is because the method takes in a Collection, not a HashSet, and I'm wondering if that means it treats it as a more generic type that doesn't know about its quick lookup. — bob, Jul 29 '15 at 20:12
@andy This may actually be tricky - see http://stackoverflow.com/questions/28671903 But for your case of removing from a list: It should take advantage of the O(1) `contains` method of `Set`. However, removing one element from an `ArrayList` has O(n) (if you remove the first element, all other elements have to be moved "one index to the left"). So there *might* be cases where it is beneficial to copy the data from the `List` into a `LinkedHashSet`, then do the `removeAll` call, and then copy the remaining elements back to a `List`, but this is only a guess - I did not do a benchmark for this! — Marco13, Jul 29 '15 at 20:32

Is creating a HashMap alongside an ArrayList just for constant-time contains() a valid strategy?

1 Answers1