59

I have a function which manipulates a very large list, exceeding about 250,000 items. For the majority of those items, it simply replaces the item at position x. However, for about 5% of them, it must remove them from the list.

Using a LinkedList seemed to be the most obvious solution to avoid expensive removals. However, naturally, accessing a LinkedList by index becomes increasingly slow as time goes on. The cost here is minutes (and a lot of them).

Using an Iterator over that LinkedList is also expensive, as I appear to need a separate copy to avoid Iterator concurrency issues while editing that list. The cost here is minutes.

However, here's where my mind is blown a bit. If I change to an ArrayList, it runs almost instantly.

For a list with 297515 elements, removing 11958 elements and modifying everything else takes 909ms. I verified that the resulting list is indeed 285557 in size, as expected, and contains the updated information I need.

Why is this so fast? I looked at the source for ArrayList in JDK6 and it appears to be using an arraycopy function as expected. I would love to understand why an ArrayList works so well here when common sense would seem to indicate that an array for this task is an awful idea, requiring shifting several hundred thousand items.

finnw
  • 47,861
  • 24
  • 143
  • 221
Ken
  • 1,075
  • 2
  • 11
  • 9
  • It's difficult to draw any meaningful conclusions here, as you haven't separated the run-time into time-for-access, time-for-modification, and time-for-removals. – Oliver Charlesworth May 23 '11 at 19:17
  • 1
    Maybe arraycopy is just really good? :) It's a very basic function that is used a lot for things you describe. I'd reckon it's very optimized. Also, you can use the `remove()` method of an `Iterator` to remove elements from a list while iterating. – musiKk May 23 '11 at 19:20
  • 1
    It would be interesting if you compare it with `LinkedList` . Why do you need copy of LinkedList? It seems to me that using iterator in your case simpler that iterating using indexes (or how do you iterate?). – Mikita Belahlazau May 23 '11 at 19:21
  • 7
    Array copy is a rather unexpensive operation and you are not yet in the range where this becomes really important. E.g. you copy approx 12000 times an array of size 150000 (on average). This does not take much time (tested here <500ms). – Howard May 23 '11 at 19:22
  • @Howard: You should make that an answer. – Oliver Charlesworth May 23 '11 at 19:29
  • @Oli Charlesworth: did as you told me ;-) – Howard May 23 '11 at 19:33
  • I suggest you profile your apps. – Kien Truong May 23 '11 at 19:38
  • @musiKk: The majority of items in the list need to be manipulated (strings, in particular), not just removed. I can't seem to find a way to do this using an iterator and strings directly. I suppose I could create a silly wrapper object that contains just the string, so I can take advantage of pass-by-reference. Does that seem like the best way? – Ken May 23 '11 at 20:39
  • Try making an array list of 10^6 elements, and removing them all with list.remove(0). That should make it feelable, even though arraycopy is fast. – Thomas Ahle May 23 '11 at 22:57
  • 1
    Oh, and if you want log(n) add, get and remove, you should consider a tree. – Thomas Ahle May 23 '11 at 22:59
  • 2
    @Nikita Beloglazov makes a good point - your quote about "I appear to need a separate copy to avoid Iterator concurrency issues while editing that list" shouldn't be so, unless you're actually dealing with multiple threads. If you all .remove() on the Iterator, this is perfectly safe; if you do .remove(n) on the List, then that's wrong. I suspect this is what you're doing and changing to Iterator.remove will make all your problems go away... – Cowan May 24 '11 at 00:18

5 Answers5

31

I ran a benchmark, trying each of the following strategies for filtering the list elements:

  • Copy the wanted elements into a new list
  • Use Iterator.remove() to remove the unwanted elements from an ArrayList
  • Use Iterator.remove() to remove the unwanted elements from a LinkedList
  • Compact the list in-place (moving the wanted elements to lower positions)
  • Remove by index (List.remove(int)) on an ArrayList
  • Remove by index (List.remove(int)) on a LinkedList

Each time I populated the list with 100000 random instances of Point and used a filter condition (based on the hash code) that would accept 95% of elements and reject the remaining 5% (the same proportion stated in the question, but with a smaller list because I didn't have time to run the test for 250000 elements.)

And the average times (on my old MacBook Pro: Core 2 Duo, 2.2GHz, 3Gb RAM) were:

CopyIntoNewListWithIterator   :      4.24ms
CopyIntoNewListWithoutIterator:      3.57ms
FilterLinkedListInPlace       :      4.21ms
RandomRemoveByIndex           :    312.50ms
SequentialRemoveByIndex       :  33632.28ms
ShiftDown                     :      3.75ms

So removing elements by index from a LinkedList was more than 300 times more expensive than removing them from an ArrayList, and probably somewhere between 6000-10000 times more expensive than the other methods (that avoid linear search and arraycopy)

Here there doesn't seem to be much difference between the four faster methods, but I ran just those four again with a 500000-element list with the following results:

CopyIntoNewListWithIterator   :     92.49ms
CopyIntoNewListWithoutIterator:     71.77ms
FilterLinkedListInPlace       :     15.73ms
ShiftDown                     :     11.86ms

I'm guessing that with the larger size cache memory becomes the limiting factor, so the cost of creating a second copy of the list becomes significant.

Here's the code:

import java.awt.Point;
import java.security.SecureRandom;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.TreeMap;

public class ListBenchmark {

    public static void main(String[] args) {
        Random rnd = new SecureRandom();
        Map<String, Long> timings = new TreeMap<String, Long>();
        for (int outerPass = 0; outerPass < 10; ++ outerPass) {
            List<FilterStrategy> strategies =
                Arrays.asList(new CopyIntoNewListWithIterator(),
                              new CopyIntoNewListWithoutIterator(),
                              new FilterLinkedListInPlace(),
                              new RandomRemoveByIndex(),
                              new SequentialRemoveByIndex(),
                              new ShiftDown());
            for (FilterStrategy strategy: strategies) {
                String strategyName = strategy.getClass().getSimpleName();
                for (int innerPass = 0; innerPass < 10; ++ innerPass) {
                    strategy.populate(rnd);
                    if (outerPass >= 5 && innerPass >= 5) {
                        Long totalTime = timings.get(strategyName);
                        if (totalTime == null) totalTime = 0L;
                        timings.put(strategyName, totalTime - System.currentTimeMillis());
                    }
                    Collection<Point> filtered = strategy.filter();
                    if (outerPass >= 5 && innerPass >= 5) {
                        Long totalTime = timings.get(strategyName);
                        timings.put(strategy.getClass().getSimpleName(), totalTime + System.currentTimeMillis());
                    }
                    CHECKSUM += filtered.hashCode();
                    System.err.printf("%-30s %d %d %d%n", strategy.getClass().getSimpleName(), outerPass, innerPass, filtered.size());
                    strategy.clear();
                }
            }
        }
        for (Map.Entry<String, Long> e: timings.entrySet()) {
            System.err.printf("%-30s: %9.2fms%n", e.getKey(), e.getValue() * (1.0/25.0));
        }
    }

    public static volatile int CHECKSUM = 0;

    static void populate(Collection<Point> dst, Random rnd) {
        for (int i = 0; i < INITIAL_SIZE; ++ i) {
            dst.add(new Point(rnd.nextInt(), rnd.nextInt()));
        }
    }

    static boolean wanted(Point p) {
        return p.hashCode() % 20 != 0;
    }

    static abstract class FilterStrategy {
        abstract void clear();
        abstract Collection<Point> filter();
        abstract void populate(Random rnd);
    }

    static final int INITIAL_SIZE = 100000;

    private static class CopyIntoNewListWithIterator extends FilterStrategy {
        public CopyIntoNewListWithIterator() {
            list = new ArrayList<Point>(INITIAL_SIZE);
        }
        @Override
        void clear() {
            list.clear();
        }
        @Override
        Collection<Point> filter() {
            ArrayList<Point> dst = new ArrayList<Point>(list.size());
            for (Point p: list) {
                if (wanted(p)) dst.add(p);
            }
            return dst;
        }
        @Override
        void populate(Random rnd) {
            ListBenchmark.populate(list, rnd);
        }
        private final ArrayList<Point> list;
    }

    private static class CopyIntoNewListWithoutIterator extends FilterStrategy {
        public CopyIntoNewListWithoutIterator() {
            list = new ArrayList<Point>(INITIAL_SIZE);
        }
        @Override
        void clear() {
            list.clear();
        }
        @Override
        Collection<Point> filter() {
            int inputSize = list.size();
            ArrayList<Point> dst = new ArrayList<Point>(inputSize);
            for (int i = 0; i < inputSize; ++ i) {
                Point p = list.get(i);
                if (wanted(p)) dst.add(p);
            }
            return dst;
        }
        @Override
        void populate(Random rnd) {
            ListBenchmark.populate(list, rnd);
        }
        private final ArrayList<Point> list;
    }

    private static class FilterLinkedListInPlace extends FilterStrategy {
        public String toString() {
            return getClass().getSimpleName();
        }
        FilterLinkedListInPlace() {
            list = new LinkedList<Point>();
        }
        @Override
        void clear() {
            list.clear();
        }
        @Override
        Collection<Point> filter() {
            for (Iterator<Point> it = list.iterator();
                 it.hasNext();
                 ) {
                Point p = it.next();
                if (! wanted(p)) it.remove();
            }
            return list;
        }
        @Override
        void populate(Random rnd) {
            ListBenchmark.populate(list, rnd);
        }
        private final LinkedList<Point> list;
    }

    private static class RandomRemoveByIndex extends FilterStrategy {
        public RandomRemoveByIndex() {
            list = new ArrayList<Point>(INITIAL_SIZE);
        }
        @Override
        void clear() {
            list.clear();
        }
        @Override
        Collection<Point> filter() {
            for (int i = 0; i < list.size();) {
                if (wanted(list.get(i))) {
                    ++ i;
                } else {
                    list.remove(i);
                }
            }
            return list;
        }
        @Override
        void populate(Random rnd) {
            ListBenchmark.populate(list, rnd);
        }
        private final ArrayList<Point> list;
    }

    private static class SequentialRemoveByIndex extends FilterStrategy {
        public SequentialRemoveByIndex() {
            list = new LinkedList<Point>();
        }
        @Override
        void clear() {
            list.clear();
        }
        @Override
        Collection<Point> filter() {
            for (int i = 0; i < list.size();) {
                if (wanted(list.get(i))) {
                    ++ i;
                } else {
                    list.remove(i);
                }
            }
            return list;
        }
        @Override
        void populate(Random rnd) {
            ListBenchmark.populate(list, rnd);
        }
        private final LinkedList<Point> list;
    }

    private static class ShiftDown extends FilterStrategy {
        public ShiftDown() {
            list = new ArrayList<Point>();
        }
        @Override
        void clear() {
            list.clear();
        }
        @Override
        Collection<Point> filter() {
            int inputSize = list.size();
            int outputSize = 0;
            for (int i = 0; i < inputSize; ++ i) {
                Point p = list.get(i);
                if (wanted(p)) {
                    list.set(outputSize++, p);
                }
            }
            list.subList(outputSize, inputSize).clear();
            return list;
        }
        @Override
        void populate(Random rnd) {
            ListBenchmark.populate(list, rnd);
        }
        private final ArrayList<Point> list;
    }

}
Dici
  • 25,226
  • 7
  • 41
  • 82
finnw
  • 47,861
  • 24
  • 143
  • 221
18

Array copy is a rather unexpensive operation. It is done on a very basic level (its a java native static method) and you are not yet in the range where the performance becomes really important.

In your example you copy approx 12000 times an array of size 150000 (on average). This does not take much time. I tested it here on my laptop and it took less than 500 ms.

Update I used the following code to measure on my laptop (Intel P8400)

import java.util.Random;

public class PerformanceArrayCopy {

    public static void main(String[] args) {

        int[] lengths = new int[] { 10000, 50000, 125000, 250000 };
        int[] loops = new int[] { 1000, 5000, 10000, 20000 };

        for (int length : lengths) {
            for (int loop : loops) {

                Object[] list1 = new Object[length];
                Object[] list2 = new Object[length];

                for (int k = 0; k < 100; k++) {
                    System.arraycopy(list1, 0, list2, 0, list1.length);
                }

                int[] len = new int[loop];
                int[] ofs = new int[loop];

                Random rnd = new Random();
                for (int k = 0; k < loop; k++) {
                    len[k] = rnd.nextInt(length);
                    ofs[k] = rnd.nextInt(length - len[k]);
                }

                long n = System.nanoTime();
                for (int k = 0; k < loop; k++) {
                    System.arraycopy(list1, ofs[k], list2, ofs[k], len[k]);
                }
                n = System.nanoTime() - n;
                System.out.print("length: " + length);
                System.out.print("\tloop: " + loop);
                System.out.print("\truntime [ms]: " + n / 1000000);
                System.out.println();
            }
        }
    }
}

Some results:

length: 10000   loop: 10000 runtime [ms]: 47
length: 50000   loop: 10000 runtime [ms]: 228
length: 125000  loop: 10000 runtime [ms]: 575
length: 250000  loop: 10000 runtime [ms]: 1198
Howard
  • 38,639
  • 9
  • 64
  • 83
  • +1: Would be useful if you could add the test code to your answer (assuming it's short), and some basic specs for your platform. – Oliver Charlesworth May 23 '11 at 19:35
  • @Oli Charlesworth: I'd like to but the system currently does no allow me to edit. Strange... – Howard May 23 '11 at 19:41
  • +1 for writing a simple benchmark while still avoiding the usual pitfalls. Although if he doesn't need random access an iterator over a linked list should be faster - that is if he uses the iterators remove which seems to be the real problem here. – Voo May 24 '11 at 10:55
10

I think the difference in performance is likely coming down to the difference that ArrayList supports random access where LinkedList does not.

If I want to get(1000) of an ArrayList I am specifying a specific index to access this, however LinkedList doesn't support this as it is organized through Node references.

If I call get(1000) of LinkedList, it will iterate the entire list until if finds index 1000 and this can be exorbitantly expensive if you have a large number of items in the LinkedList.

maple_shaft
  • 10,435
  • 6
  • 46
  • 74
6

Interesting and unexpected results. This is just a hypothesis, but...

On average one of your array element removals will require moving half of your list (everything after it) back one element. If each item is a 64-bit pointer to an object (8 bytes), then this means copying 125000 items x 8 Bytes per pointer = 1 MB.

A modern CPU can copy a contiguous block of 1 MB of RAM to RAM pretty quickly.

Compared to looping over a linked list for every access, which requires comparisons and branching and other CPU unfriendly activities, the RAM copy is fast.

You should really try benchmarking the various operations independently and see how efficient they are with various list implementations. Share your results here if you do!

dkamins
  • 21,450
  • 7
  • 55
  • 59
  • Perhaps I'm mistaken about the best way to handle this then. Because due to needing to edit the list itself, I did something like this to get a new list: List originalList = new LinkedList(); originalList = getAllItems(); List newList = new LinkedList(originalList); Creation of that second list was slower than all of those remove operations combined. – Ken May 23 '11 at 20:18
  • 2
    @Ken - I think you should make a new question specifically about how to copy lists, and include sample code. – dkamins May 23 '11 at 20:19
6

I'm skipping over some implementation details on purpose here, just to explain the fundamental difference.

To remove the N-th element of a list of M elements, the LinkedList implementation will navigate up to this element, then simply remove it and update the pointers of the N-1 and N+1 elements accordingly. This second operation is very simple, but it's getting up to this element that costs you time.

For an ArrayList however, the access time is instantaneous as it is backed by an array, meaning contiguous memory spaces. You can jump directly to the right memory address to perform, broadly speaking, the following:

  • reallocate a new array of M - 1 elements
  • put everything from 0 to N - 1 at index 0 in the new arraylist's array
  • put everything N + 1 to M at index N in the arraylist's array.

Thinking of it, you'll notice you can even reuse the same array as Java can use ArrayList with pre-allocated sizes, so if you remove elements you might as well skip steps 1 and 2 and directly do step 3 and update your size.

Memory accesses are fast, and copying a chunk of memory is probably sufficiently fast on modern hardware that moving to the N-position is too time consuming.

However, should you use your LinkedList in such a way that it allows you to remove multiple elements that follow each other and keep track of your position, you would see a gain.

But clearly, on a long list, doing a simple remove(i) will be costly.


To add a bilt of salt and spice to this:

  • See the note on Efficiency on the Array Data Structure and the note on Performance on the Dynamic Array Wikipedia entries, which describe your concern.
  • Keep in mind that using a memory structure that requires contiguous memory requires, well, contiguous memory. Which means your virtual memory will need to be able to allocate contiguous chunks. Or even with Java, you'll see your JVM happily going down with an obscure OutOfMemoryException taking its cause in a low-level crash.
haylem
  • 22,460
  • 3
  • 67
  • 96
  • the reallocation in the first step doesn't happen, arraylist can keep a shorter list than it's capacity (== length backing array) and System.arrayCopy allows overlapping copies – ratchet freak May 23 '11 at 20:18
  • @ratchet freak: which is exactly what I mention right after, but thanks for the added clarification if mine was a bit fuzzy. – haylem May 23 '11 at 21:29