8

I've been fooling around with a bunch of different ways of searching collections, collections of collections, etc. Doing lots of stupid little tests to verify my understanding. Here is one which boggles me (source code further below).

In short, I am generating N random integers and adding them to a list. The list is NOT sorted. I then use Collections.contains() to look for a value in the list. I intentionally look for a value that I know won't be there, because I want to ensure that the entire list space is probed. I time this search.

I then do another linear search manually, iterating through each element of the list and checking if it matches my target. I also time this search.

On average, the second search takes 33% longer than the first one. By my logic, the first search must also be linear, because the list is unsorted. The only possibility I could think of (which I immediately discard) is that Java is making a sorted copy of my list just for the search, but (1) I did not authorize that usage of memory space and (2) I would think that would result in MUCH more significant time savings with such a large N.

So if both searches are linear, they should both take the same amount of time. Somehow the Collections class has optimized this search, but I can't figure out how. So... what am I missing?

import java.util.*;

public class ListSearch {

    public static void main(String[] args) {

        int N = 10000000; // number of ints to add to the list
        int high = 100; // upper limit for random int generation

        List<Integer> ints;
        int target = -1; // target will not be found, forces search of entire list space

        long start;
        long end;

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 1)... ");
        if (ints.contains(target)) {
            // nothing
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");

        System.out.println();

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 2)... ");
        for (Integer i : ints) {
            // nothing
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");
    }
}

EDIT: Below is a new version of this code. The interesting thing is that now my manual linear loop performs 16% faster than the contains method (NOTE: both are designed to intentionally search the entire list space, so I know they are equal in number of iterations). I can't account for this 16% gain... more confusion.

import java.util.*;

public class ListSearch {

    public static void main(String[] args) {

        int N = 10000000; // number of ints to add to the list
        int high = 100; // upper limit for random int generation

        List<Integer> ints;
        int target = -1; // target will not be found, forces search of entire list space

        long start;
        long end;

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 1)... ");
        if (ints.contains(target)) {
            System.out.println("hit");
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");

        System.out.println();

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 2)... ");
        for (int i = 0; i < N; i++) {
            if (ints.get(i) == target) {
                System.out.println("hit");
            }
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");
    }
}
The111
  • 5,757
  • 4
  • 39
  • 55
  • Do you realize that your second "search" is not even searching? It is just iterating the elements of the list ... – Stephen C Oct 20 '12 at 05:20
  • Yeah, I actually just realized that a few minutes ago and am playing with my code more now. Sorry about that. May have to run soon but will update this post later. – The111 Oct 20 '12 at 05:23

3 Answers3

6

Your comparison code is buggy, and this is distorting your results.

This does search for the target

    if (ints.contains(target)) {
        // nothing
    }

But this does not!

    for (Integer i : ints) {
        // nothing
    }

You are actually just iterating over the list elements without testing them.

Having said that, the second version is slower than the first version for one or more of the following reasons:

  • The first version will be iterating the backing array using a simple for loop and an index. The second version is equivalent to the following:

    Iterator<Integer> it = ints.iterator();
    while (it.hasNext()) {
        Integer i = (Integer) it.next();
    }
    

    In other words, each time around the loop involves 2 method calls and a typecast1.

  • The first version will return true as soon as it gets a match. Because of the bug in your implementation, the second version will iterate the entire list every time. In fact, given the choice of N and high, this effect is the most likely the main cause of the difference in performance.

1 - Actually, it is not entirely clear what the JIT compiler will do with all of this. It could in theory inline the method calls, deduce that the typcaset is unnecessary, or even optimize away the entire loop. On the other hand, there are factors that may inhibit these optimizations. For instance, ints is declared as List<Integer>which could inhibit inlining ... unless the JIT is able to deduce that the actual type is always the same.


Your results are possibly also distorted for another reason. Your code does not take account of JVM warmup. Read this Question for more details: How do I write a correct micro-benchmark in Java?

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks Stephen. Regarding your second bullet, you are correct that I was neglecting to actually search in the second case, but you may have missed the part where I said I am intentionally trying to iterate the _entire_ list every time. To further that end, I'm searching for a value that I know won't be found. I've updated my post to include a second program. In this one, the manual linear search (of the entire list) performs 16% _faster_ than the `contains` method (also searching the entire list). Can you explain this new finding? Thanks again. – The111 Oct 20 '12 at 06:35
  • 1
    One reason is that you are using `==` rather than `.equals()` to compare the values. That is unreliable ... it will break for large integers. And you still haven't fixed the JVM warmup issue which makes your results doubtful anyway. – Stephen C Oct 20 '12 at 07:31
  • I didn't show it here, but one thing I did do was wrap the entire generating/searching blocks in a big repeat loop to force each one to run several times. I did notice that after the first couple runs, the reported times stabilized and stopped changing so much. Not sure if that counts as JVM warmup or not. But, I just changed the `==` to `.equals()` and sure enough that slowed that search down a lot, now the `contains` version is outperforming again. – The111 Oct 20 '12 at 08:55
3

Here is the difference:

When you use contains, it use the internal array of the object and does a search like this:

    for (int i = 0; i < size; i++)
        if (searchObject.equals(listObject[i]))
            return true;
    return false;

Here when it tries to get the ith element, it directly gets the ith element object form the internal array.

When you write your own, you are writing like:

    for (Integer i : ints) {
        // nothing
    }

its equivalent to :

   for(Iterator<Integer> iter= ints.iterator(); iter.hasNext(); ) {
         Integer i = iter.next();
   }

Which is performing much more steps than the contains.

Yogendra Singh
  • 33,927
  • 6
  • 63
  • 73
  • Thanks, that sounds believable and would make sense. Where is it documented, if you don't mind my asking? – The111 Oct 20 '12 at 04:53
  • 2
    The111: Most of the time, I refer the source code itself :) You may want to open the source of any collection implementation e.g. `ArrayList` and check the methods. – Yogendra Singh Oct 20 '12 at 04:57
  • Embarrassed to admit I had never even considered that. Thanks. – The111 Oct 20 '12 at 05:01
  • 1
    It doesn't apply to his case, because he doesn't make any verifications in his for. – Daniel Pereira Oct 20 '12 at 05:11
  • 1
    +1 to your comment about just checking the source, but -1 to the answer. The loop expansion you proposed doesn't make sense, the enhanced for loop expands to an `Iterator` and it would be incredibly unwise to use `get` on an unknown data-structure to iterate it. – Tim Bender Oct 20 '12 at 05:17
  • @AmitD: Earlier I explained the linear search reading the subject line. I update the answer with corrected for loop expansion against the code now. If you think its good now, please reconsider the -ve vote. – Yogendra Singh Oct 20 '12 at 05:57
  • @TimBender: Earlier I explained the linear search reading the subject line. I update the answer with corrected for loop expansion against the code now. If you think its good now, please reconsider the -ve vote. – Yogendra Singh Oct 20 '12 at 05:57
1

So I'm not entirely sure you're testing anything. Javac (compiler) is smart enough to realize that you don't have any code inside your for loop and your if statement. In this case, Java will remove that code from it's compilation. The reason you might be getting time back is because you're actually counting the time it takes to print a string. System output time can vary drastically depending on what your system is doing. When writing timing tests, any I/O can create invalid tests.

First I would remove the string prints from inside your timing.

Second of all, ArrayList.contains is linear. It doesn't use the special for loop like you're doing. Your loop has some overhead of getting the iterator from the collection and then iterating over it. This is how the special for loop works behind the scenes.

Hope this helps.

Kurtymckurt
  • 337
  • 2
  • 6
  • Thanks. That is interesting info. But in this case I do not think the loops are being removed by the compiler, because in earlier tests I was doing a simple operation in there, and those tests were coming up with similar times. Also, just now, I commented out all the printing lines, and it still seems like the program takes the same time to run (granted, this is around half a second, so is hard to estimate... but if it were truly doing nothing it would be instantaneous and not a large fraction of a second). – The111 Oct 20 '12 at 05:04