3

Java 8

I have quite a large collection (about 100M elements) and I need to iterate through it and do some action. There are two alternatives:

  1. Iterate through it once and do the entire job (it will complicate the code significantly)

  2. Iterate through it twice and do half of the job during the first iteration and rest during the second one (this will simplify the code significatly)

So, I thought that iterating is not so expensive and wrote a simple example to measure it (I don't write benchmarks too often, therefore it may look a bit silly):

    Collection<Double> col = new LinkedList<>();
    for(int i = 0; i < 30000000; i++){
        col.add(Math.sqrt(i + 1));
    }
    long start1 = System.nanoTime();
    Double res = 0.0;
    for(Double d : col){
        res += d + d;
    }
    long end1 = System.nanoTime();
    System.out.println(end1 - start1);
    System.out.println("=================================");
    long start2 = System.nanoTime();
    Double res2 = 0.0;
    for(Double d : col){
        res2 += d;
    }
    for(Double d : col){
        res2 += d;
    }
    long end2 = System.nanoTime();
    System.out.println(end2 - start2);

The average result is the following:

1107881047 First

2133450162 Second (twice as slower)

So, iteration is quite a slow process. But I didn't understand why? I thought we do almost the same amount of work, so the performance would be that much different.

Remarkably, that if I use ArrayList instead of linked list, the result is:

3858616604 First

422297749 Second (Ten times faster than the first, twice as faster than in the example above).

Couldn't you explain in a nutshell this performance difference?

user3663882
  • 6,957
  • 10
  • 51
  • 92
  • 2
    There's a lot boxing/unboxing in your code, which is definitely contributing to the performance hit you've encountered. You should modify your code to avoid that if possible. – code_dredd Aug 05 '16 at 09:03
  • @ray Yes, I know. But the question is not in that point. I'm concerned about Array/LinkedList performance. – user3663882 Aug 05 '16 at 09:04
  • 3
    Some reading: http://stackoverflow.com/questions/322715/when-to-use-linkedlist-over-arraylist?rq=1 –  Aug 05 '16 at 09:06
  • 1
    "we do almost the same amount of work," The second test is doing twice the amount of work is slightly less than double the time so it is faster not slower. You would expect it to be a little faster as it has warmed up the LinkedList code. – Peter Lawrey Aug 05 '16 at 09:44

4 Answers4

6

First, for benchmarks you are much better using JMH tool

Now, to the original question. When you iterate over ArrayList you essentially do a sequential scan over an array, which is a single contiguous block of memory. CPU does a perfect job of prefetching that from main memory to CPU caches. Thus it is very fast.

In case of LinkedList you have to go from one element to another via an object reference. And in general case objects for each node can reside anywhere in the memory. So you have to constantly jump from one memory location to a completely unrelated one. CPU cannot predict that, cannot fetch data from the main memory. Thus you are constantly waiting for data.

Nikem
  • 5,716
  • 3
  • 32
  • 59
3

I think there're several reasons for the performance hit you're seeing.

  1. Auto-Boxing
  2. The Algorithm

Auto-Boxing

Every time you take a primitive type (e.g. int, float, etc) and put it into its class equivalent (i.e. Integer, Float, etc), the primitive value gets boxed-in. When you need to use it as a primitive, it needs to be un-boxed. The point here is that it takes CPU cycles and time to do this.

This SO post can provide more detail on auto-boxing.

The Algorithm

In simple terms, the algorithm you're using that iterates over the list only once is O(n), while the implementation you used that requires double iteration is now O(2n). Both are still O(n), but understand that Big-O is an asymptotic upper bound, not a measure of "performance". It should still be clear that iterating over the same list twice will take about twice as much work as iterating over the list only once.

The algorithm and auto-boxing combine for a noticeable effect, and you can see that in code like this:

for(int i = 0; i < reallyLargeNumberHere; i++){
    col.add(Math.sqrt(i + 1));
}

The sqrt calculation itself is expensive and the auto-boxing needs to be done to add each element. Still, the double-iteration over the large list is going to be the main culprit.

In Short

  1. You're iterating over a large set, and you do this twice in the 2nd implementation; you must pay the O(n) penalty twice now.
  2. You're suffering the hit caused by auto-boxing.
  3. The sqrt algorithm is itself slow.
  4. Linked list implementations are O(1) when adding/removing nodes, but this is not so for iterations/searches --they tend to be O(n)
  5. Linked lists are not friendly to the CPU cache because they're not allocated in contiguous memory blocks, like a primitive array would be.

Point #5 is about the CPU's ability to cache what it thinks you're about to use, but data being in different regions of memory makes this effort more expensive and less likely to improve performance as intended. (When reading about CPU pipelines and cache hit/misses, this SO post will be relevant.)

It's very important to know what your program will be doing most of the time, so that you can choose an appropriate data structure and algorithm to match. Poorly matched data structures and algorithms will get you easily avoidable performance problems.

This Big-O cheat-sheet might be helpful.

Lastly, while for your simple purposes the use of System.nanoTime is likely ok, you should consider using a tool that's been specifically designed to benchmark code if you want to do more serious stuff.

Community
  • 1
  • 1
code_dredd
  • 5,915
  • 1
  • 25
  • 53
1

There is alot of unboxing/boxing that is taking place, that does affect the performance a bit.

But apart from that you must know the difference between Arrays and Linked lists.

In Arrays, a contiguous block of memory is allocated for the array. Because of which Random access of elements are possible.

So when i want to get say 5000th element of a large array, it would be returning it in an instant.

Time Complexity

Accessing element = O(1)

But in linked lists, you cannot random access an element. For accessing an element you would need to traverse through the linked list until the required element comes.

So if I access 3rd element, internally it happens like 1->2->3, return 3rd element.

The first element does not have any information about the 3rd element, it only has a link to second element.

Time Complexity

Accessing element = O(n)

You need to understand the concepts of data structures and time complexity , when you make a decision what kind of collection to use. Because in large data sets, it may result in huge performance difference.

You can refer to this link for Time Complexities: http://bigocheatsheet.com/

Shaurya Chaudhuri
  • 3,772
  • 6
  • 28
  • 58
1

At First iteration, it takes O(n) and the second iteration takes O(2n) because what I see at your code is iterating the whole list twice instead of iterating two halfs.

//First:
for(Double d : col){
    res += d + d;
}

//Second
for(Double d : col){
    res2 += d;
}
for(Double d : col){
    res2 += d;
}

Note: Even if you modify your code to accomplish the half portions in two iterations, it will never be faster than the single iteration. It will be slower or equal.

Why Iteration of the LinkedList is slow?

You must prefer to use data structures according to your needs. For example LinkedList --> It is very fast for insertion and deletion but very slow for iteration.

Is ArraylList faster?

Yes it is faster and the fastest data structure is array. Because, for arrays, to fetch an item at index x, the addess of the element at index x is calculated as [[address of the array object]] + x * [[item_size_in_memory]] (maybe not this but closer).

What you can do?

1) You can update your code to use arrays instead of using LinkedLists. As I see you must iterate very huge size of elements so you must somehow gain speed to accomplish your task via using array or a data structure faster than LinkedLists.

2) Try out LinkedList iterator. I didn't try out but the iterator may be faster than the normal iteration. For example:

LinkedList<String> linkedList = new LinkedList<String>();
linkedList.add("eBay");
linkedList.add("Paypal");
linkedList.add("Google");
linkedList.add("Yahoo");
linkedList.add("IBM");
linkedList.add("Facebook");
// ListIterator approach
System.out.println("ListIterator Approach: ");
ListIterator<String> listIterator = linkedList.listIterator();
while (listIterator.hasNext()) {
    System.out.println(listIterator.next());
}

Note: I have taken the sample from this link.

Bahadir Tasdemir
  • 10,325
  • 4
  • 49
  • 61