LinkedHashMap complexity

Question

i have simple problem to find the first unique element in array A. But, what bothers me is the time complexity using different methods. I've tried this two methods so far.

First Method:

LinkedHashMap<Integer, List<Integer>> map = new LinkedHashMap<Integer, List<Integer>>();
for (int i = 0; i < A.length; i++)
{
    if (!map.containsKey(A[i]))
        map.put(A[i], new ArrayList<>());
    map.get(A[i]).add(i);
}

for (Map.Entry<Integer, List<Integer>> m : map.entrySet())
    if (m.getValue().size() == 1)
        return m.getKey();
return -1;

Second Method:

    for(int i=0; i< A.length; i++){
        boolean unique = true;
        nestedFor:for(int j=0; j< A.length; j++){
            if(i != j && A[i] == A[j]){
                unique = false;
                break nestedFor;
            }
        }
        if(unique)
            return A[i];
    }
    return -1;

Testing with array of 1000000 elements, the first method executes at ~2000ms, while the second at ~10ms. My question is: shouldn't the first method executes faster since its complexity is O(nLogn) compared to second method which complexity is O(n^2)? What am i missing here ? Below the test code:

    int[] n = new int[1000000];
    for (int i = 0; i < n.length; i++)
        n[i] = new Random().nextInt(2000000);

    long start = System.currentTimeMillis();
    firstUnique(n);
    System.err.println("Finished at: " + (System.currentTimeMillis() - start ) + "ms");

EDIT:

for (int i = 0; i < A.length; i++)
{
    if (!map.containsKey(A[i]))
        map.put(A[i], new ArrayList<>());
    map.get(A[i]).add(i);
}

Consumes 99% of execution time, while

for (Map.Entry<Integer, List<Integer>> m : map.entrySet())
    if (m.getValue().size() == 1)
        return m.getKey();

is always 1-3ms. So, yes filling the map is the most expensive operation.

What would you suggest as most efficient method for this kind of problem?

with the first method you're also measuring the creation of at least 2 million objects that run out of scope at the end of your function call - however your GC is handling that. if `A` in the second version is the `int[]` then you don't have this overhead there... — BeyelerStudios, May 16 '16 at 20:19
Using System.currentTimeMillis() is not a good way to do a benchmark see http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java for more information on doing a Java benchmark — CConard96, May 16 '16 at 20:19
You can make the second method roughly twice as fast (in the worst case) by iterating `for (int j = i + 1; ...)`. Not only do you iterate half the elements, you also skip the `i != j` check. — Andy Turner, May 16 '16 at 20:36

Andy Turner · Accepted Answer · 2016-05-16T20:57:44.300

2

I suspect that you are not picking inputs which create "worst case" conditions for the second case.

For instance, if you construct the array such that all of the million elements have a duplicate (e.g. A[i] = 2 * i / A.length), then the second method is way, way slower than the first, since it has to check 10^12 combinations of elements.

You can make it a bit faster (roughly twice as fast) by changing the condition on the inner for loop to only check from j = i + 1, but 10^12 / 2 is still a pretty big number.

If you are simply picking random numbers to populate the array, there is a reasonable chance that the first element is unique, and a greater chance that one of the first and second elements is unique, etc. After a few elements, you'll reach near-certainty that the element is unique, so it will stop after a few iterations.

The 2 seconds taken for the first method is way too long. I can only think that you're not warming up your JIT correctly before the benchmark. But even without trying to do that, your first method only takes 40-50ms for me (dropping to 10-15ms after a few iterations).

Most of this time will be due to object creation - both in the autoboxing of the keys and values, and in the creation of the ArrayList instances.

edited May 16 '16 at 20:57

answered May 16 '16 at 20:49

Andy Turner

137,514
11
162
243

Totally makes sense ! I just changed n[i] = new Random().nextInt(10000) insetead of .nextInt(2000000) and guess what - second method finished at 34532ms while first method remained unchanged (~2ms). Good point ! – user3215799 May 16 '16 at 20:57
Also, how would you boost up first method? – user3215799 May 16 '16 at 21:00
Not sure how much it would improve it, but I'd look simply to count the number of occurrences of the element, rather than storing the indices. You might also want to box `A[i]` once, and reuse that value, to avoid repeatedly creating objects. – Andy Turner May 16 '16 at 21:03
A way without creating any values might be to create an enum to serve as the map value, and use it like a per-key state machine: when the element isn't present, you add `UNIQUE` as the value; if `UNIQUE` is present, you set `DUPLICATED` as the value (and if `DUPLICATED` is present, you set `DUPLICATED` as the value). Then you wouldn't need to create any additional instances, since enum values are singletons. This could be quite elegant using `Map.compute` in Java 8. – Andy Turner May 16 '16 at 21:05

score 1 · Answer 2 · answered May 16 '16 at 20:24

1

Time complexity ignores coeffecients because often it is more useful to know how a function grows with increasing input sizes. Although your first function has a lower time complexity, at small input sizes, it will run much slower because you are making many ArrayList objects which is computationally expensive. Your second method however only uses array accesses which is much less expensive than instantiating an object.

answered May 16 '16 at 20:24

Natecat

2,175
1
17
20

Do you mean "at large input sizes"? Unless you were thinking their input was small... – 4castle May 16 '16 at 20:26
No he wrote what he meant and he meant what he wrote and it is correct. – Erwin Smout May 16 '16 at 20:29

score 1 · Answer 3 · answered May 16 '16 at 20:37

Time complexity is meant to be understood in its asymptotic sense (i.e. as input sizes grow to googolplex), and nothing else. If an algorithm has linear time complexity, that only means that there is some a,b such that executiontime (roughly !!!) = a * inputsize + b. It says nothing about the actual magnitude of a and b, and two linear algorithms can still have massive performance differences because the magnitudes of their a/b differ vastly.

(Also, your example is a poor one because time complexity of an algorithm should take into account the complexity of all the underlying operations such as object creation and the like. Others have hinted at that in their answers too.)

thanks for you replay, what would you use if you need to solve the same problem (first unique in array) ? — user3215799, May 16 '16 at 20:42

score 1 · Answer 4 · answered May 16 '16 at 21:02

Consider using 2 sets instead:

public int returnFirstUnqiue(int[] a)
{
  final LinkedHashSet<Integer> uniqueValues = new LinkedHashSet<Integer>(a.length);
  final HashSet<Integer> dupValues = new HashSet<Integer>(a.length);

  for (int i : a)
  {
    final Integer obj = i;
    if (!dupValues.contains(obj))
    {
      if (!uniqueValues.add(obj))
      {
        uniqueValues.remove(obj);
        dupValues.add(obj);
      }
    }
  }

  if (!uniqueValues.isEmpty())
  {
    return uniqueValues.iterator().next();
  }
  return -1;
}

Jiri Tousek · Answer 5 · 2016-05-17T07:05:16.327

First as to why the benchmark isn't relevant:

Even if we ignore inaccuracies caused by used method, GC etc., finding out that method 2 is faster on a million entries won't tell you anything about how it will perform on a billion of entries
- Big-O is a theoretical concept and has to be proved theoretically. Most a benchmark can do for you here is let you estimate the complexity, and that would be done not by comparing two methods on one input, but by comparing one method on multiple inputs, each an order of magnitude larger than the previous (and even then it's near impossible to draw any useful conclusions)
Big-O is a worst-case complexity, but your random input is likely to be somewhere "in the middle" for the first method (map), while it will be far from worst case for the array - actually it has 50% chance of succeeding on first iteration, while the map has to be fully processed and will in average have about half a million entries
- the worst case for the "map" method would probably be all elements different but with equal hashcode (so you would need to read the whole list of added elements in each of the n iterations)
- the worst case for the "array" method is all elements being equal (need complete the whole nested iteration)

As to finding a good algorithm - you could use a Map<Integer, Boolean> instead of Map<Integer, List<Integer> since you only need to store the unique flag and not a list of values - add with True when you see the element first time, switch to False when you encounter a duplicity

LinkedHashMap operations put, containsKey/get have big-O complexity O(n) (worst-case) making the whole algorithm O(n^2)
However, the amortized complexity of put is O(1) (making amortized complexity of all insertions O(n)) and average complexity of get is constant (this depends on how well the hash function used works for the inputs given); unique value lookup is then O(n)

So you are saying that the first method is also O(n^2) complex? — user3215799, May 16 '16 at 21:33
Also, sorting the array then finding first unique element is not a solution since i need to find the first unique element (with the smallest index) — user3215799, May 16 '16 at 21:33
I updated my idea of worst case scenario for the map - see post. Yes, in worst case all elements are different but fall into the same bucket of the HashMap, so for every element, you have to read all past elements. This is almost exactly same as the second algorithm, which for each element has to read (after optimization suggested in other post) all future elements. — Jiri Tousek, May 17 '16 at 07:00
As for sorting the array idea, you're right. I thought stable sorting algorithm would help you but that's obviously wrong. — Jiri Tousek, May 17 '16 at 07:04

RMachnik · Answer 6 · 2016-05-16T20:26:42.720

0

My observations: Second method is much quicker because it is using Array with declared width. In First example there are changes of the size happenings.

Please try to define more accurate size of LinkedHashMap to set initial capacity equal to 1000000.

Next thing here is that Array is much simpler structure where GC is not trying to do anything. But when it comes to LinkedHashMap its more complex and cost of creating it and manipulating is far much complex in some cases than simple getting element at particular index from Array.

edited May 16 '16 at 20:26

answered May 16 '16 at 20:21

RMachnik

3,598
1
34
51

I am not comparing `ArryList` to `LinkedHashMap` - > I am comparing simple array to `LinkedHashMap`. Agree, inside `LinkedHashMap` there are allocations of `ArrayLists`. – RMachnik May 16 '16 at 20:36

LinkedHashMap complexity

6 Answers6