11

I'm trying to remove 140,000 objects from an ArrayList of size 7,140,000. I expected this would take seconds (if that), but instead Java is taking multiple seconds per thousand objects. Here is my code:

     for (int i = list.size(); i > P; i--)
     {
         int size = list.size();

         int index = (int) (Math.random() * size);

         list.remove(index);
     }

Note: P is a constant that I previously set to 7,000,000.

The goal of the loop is to randomly remove objects from list until its size is 7,000,000.

Is Java taking such a long time because I'm starting off with over 7 million objects? I've never noticed this efficiency problem with removing from ArrayLists in the past. If it helps, I use the DrJava Beta IDE.

  • Does order of remaining items matter? – Piro Sep 26 '17 at 07:36
  • 1
    If the only criterion is to make the list of size 7,000,000, why do you need to do it randomly and pay the cost of moving the array? Why can't you just delete the elements from 7,140,000 to 7,000,000 ? – Nishit Oct 08 '17 at 03:11

2 Answers2

7

Every time you remove an element from an ArrayList, it has to shuffle all of the elements with greater indexes down by one slot. Say you remove the first element of a 7M-element list - you've then got to move 6,999,999 elements too.

If you're doing this in a loop, it will take O(n^2) time, where n is the size of the list. For a 7M-element list, that's going to be pretty slow.

Instead, if you know which elements you want to remove in advance, you can shift all the elements down in a single pass:

int dst = 0;
for (int src = 0; src < list.size(); ++src) {
  if (!toRemove(src)) {
    list.set(dst++, list.get(src));
  }
}
list.subList(dst, list.size()).clear();

where toRemove(src) is some function which says whether you want to remove the src-th element.

For example, you might construct a BitSet with all but P elements set:

BitSet toRemove = new BitSet(list.size());
for (int i = list.size(); i > P; i--) {
  int rand;
  do {
    rand = Math.random() * list.size();
  } while (toRemove.get(rand));
  toRemove.set(rand, true);
}

You still have to shift all of the 6,999,999 elements to the right if you just remove the zero-element from a 7M element list; but any other removals don't require any more shifts on top. This algorithm is O(n), where n is the size of the list.


Edit: you can pick P elements from the list (where P <= list.size()) like this:

int dst = 0;
Random rand = new Random();
for (int src = 0; dst < P; ++src) {
  if (rand.nextInt(list.size() - src) < (P-dst)) {
    list.set(dst++, list.get(src));
  }
}
list.subList(dst, list.size()).clear();

This strategy will select elements from the list with equal probability (*), and works well for any value of P; it also preserves the original order.


If you want to sample K items from a list with N items without drawing the same element twice, there are choose(N, K) = N! / (K! * (N-K)!) ways to do this. If you want to pick all elements from the list with equal probability, then you should pick any of these c(n,k) different configurations.

When there are k items left to pick from n items, you will either:

  • pick the first item; and then pick k-1 items from the remaining n-1 items; or
  • not pick the first item; and then pick k items from the remaining n-1 items.

In order to ensure the equal probability of picking the K elements overall, you need to choose one of the two options according to the number of combinations for picking from the n-1 elements:

                                   #(combinations after taking first item) 
P(take first item) = ------------------------------------------------------------------
                     #(combinations after taking) + #(combinations after not taking)

                   = C(n-1,k-1) / (C(n-1, k-1) + C(n-1, k))

                   = ... working omitted ...

                   = k / n

So, when you've got k items left to take from n, you should take the first item k/n of the time.

The two interesting cases to point out are:

  • When k == n, k/n = 1, so you always take the element. Intuitively, if you've got to pick n items out of n, you've got to take them all.
  • When k == 0, k/n = 0, so you never take the element. Intuitively, if you've already picked all K of your items, you don't need to take any more.

To implement this, you can simply generate a uniformly-distributed random number r in the range [0..n), and "take" the element from the list if r < k.

In terms of the implementation above, k = P - dst, and n = list.size() - src.

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • I agree. `ArrayList` is an array **de-facto**. You have to use it only if you do not plan to remove it, only add element. In case you want to remove element from it, it is better to create **another** `ArrayList` and add there only required elements. As alternative, you could use `LinkedList`, which is more useful when you wan to modify list in the middle (add or remove elements in it). – Oleg Cherednik Sep 26 '17 at 07:19
  • 1
    Thanks for your answer. I did some of my own research just now, and it matches with the idea in your answer. I didn't realize how much time it took to move all the elements down on each pass, but it makes sense. My solution involved randomly setting 140,000 elements to null, and then putting all "non-null" elements in a temporary ArrayList. I then set list to the temporary ArrayList. Same idea as your solution I think, and it ran quickly as expected. – Inertial Ignorance Sep 26 '17 at 07:20
  • @oleg.cherednik ArrayList is ok to use in general use. If it's small or if it barely needs any operations on. Here it's just too much... – android developer Sep 26 '17 at 07:20
  • @oleg.cherednik Yes, I ended up doing what you mentioned. It worked fine. – Inertial Ignorance Sep 26 '17 at 07:21
  • @InertialIgnorance I've written some other possible solutions, if you wish. – android developer Sep 26 '17 at 07:21
  • @android developer Sure, I'd be interested in seeing some different algorithms to solving the problem. My solution seems to work fast enough though (removes the 140,000 objects in less than 1 second). – Inertial Ignorance Sep 26 '17 at 07:27
  • The "BitSet" solution seems to have a random efficiency issue, as it might go to the same bits that were set to true before. This has a higher chance to have a bad performance algorithm, the more the items you want to remove from the input (because you're setting more and more items to true, out of the entire list). For example, if you want to remove 99% of the items, you will get to a situtation that nearly 99% of the loop will fail to get which bit to set. I prefer and suggest a more deterministic solution instead – android developer Sep 26 '17 at 07:34
  • @android developer That's true. I suppose it only worked well for me because I was removing 140,000 / 7,140,000 objects: Under 2% – Inertial Ignorance Sep 26 '17 at 07:53
  • @InertialIgnorance "then putting all "non-null" elements in a temporary ArrayList" This isn't the same as my solution: it involves additional space for the temporary list. My solution does it in-place. – Andy Turner Sep 26 '17 at 08:00
  • @androiddeveloper "a random efficiency issue" I did say "for example". If you're removing 99% of the elements, it'd be easier to pick the 1% you aren't going to remove. This isn't an inherent problem with the approach. – Andy Turner Sep 26 '17 at 08:02
  • @Andy Turner I know, I meant the same general idea you mentioned. The implementation / solution is very different. – Inertial Ignorance Sep 26 '17 at 08:02
  • @oleg.cherednik "if you do not plan to remove it" removing elements from an `ArrayList` is fine - but only if you are removing them *at the end*. And it's *not* "better" to create another `ArrayList`: this requires additional storage space. Since it is very efficient to get and set elements by index in an `ArrayList`, doing it in-place can be preferable if you don't need the original list after. – Andy Turner Sep 26 '17 at 08:03
  • @AndyTurner The 99% was an example to illustrate the issue. Even 50% can be a bad thing. Theoretically, random algorithms such as this one can run forever (always choosing an already marked item, for example). In any case, I always prefer a deterministic solution, even if it can be a bit more complex. – android developer Sep 26 '17 at 08:18
  • @AndyTurner In order to fix your algorithm, instead of randomly choosing an item of the entire BitSet , you would have to randomly traverse it in linear way. The random steps should be based on how many items you've marked and how many are left to mark. – android developer Sep 26 '17 at 08:25
6

An ArrayList is backed by an array, so modifications need to really move items aside, and in some cases even create a whole new array.

Some possible solutions:

  1. Consider using LinkedList or skip-list implementation instead. Do note that here, to remove an item it still takes O(N) (or O(logN) in skip-list), because it has to find it. However, you can traverse the items with a chance based on how many items you've removed.

  2. You could randomly take items from the input into a new ArrayList till you get the number of items you wish. You have to know which items you added though, so traverse in a linear way, and have the random chooser to have a chance of how many steps to go, based on how many items you've moved.

  3. Easiest solution: shuffle the entire input array, and then choose the first M items.

Here's a possible code for solution #3:

public static List<String> pickNRandom(List<String> lst, int m) {
    Collections.shuffle(lst);
    return lst.subList(0, n);
}

The disadvantage here is that it ruins the order of the items. You can overcome this by creating a copy of the list as the input, but this will take more memory (temporarily) ...

android developer
  • 114,585
  • 152
  • 739
  • 1,270
  • Good idea... I never thought of solution #3. Very elegant. Is there a built in Java function I can use to shuffle the ArrayList? Or would I have to write my own function. In addition, couldn't shuffling all the elements in the ArrayList cause the same problem I was having (many elements constantly being shifted "down one") ? – Inertial Ignorance Sep 26 '17 at 07:55
  • @InertialIgnorance Actually there is a function for this: https://docs.oracle.com/javase/6/docs/api/java/util/Collections.html#shuffle(java.util.List) , but I never used it. It's quite easy to implement it. You can use a loop over each of the items of the list, and swap with a random item of the list. Solution 3 is the best I think, as you can create the new list just in the end, with the result items. The only possible problem is that you ruin the order of the input items. You can save the order temporarily, but this takes extra space... – android developer Sep 26 '17 at 08:17
  • Here's a sample answer I've found similar to solution #3 : https://stackoverflow.com/a/8378788/878126 – android developer Sep 26 '17 at 08:39
  • And here are other solutions, similar to solutions #1,#2 I've written: https://en.wikipedia.org/wiki/Reservoir_sampling – android developer Sep 26 '17 at 08:42
  • Makes sense. Running one loop through the list and swapping two random objects on each iteration should be very quick. In my case the order doesn't matter at all, so the solution is ideal. – Inertial Ignorance Sep 26 '17 at 09:12
  • @InertialIgnorance So you can try the 2-lines of codes I've shown. I wonder what's the best solution that doesn't ruin the order of items, yet is efficient in both memory and time. I never implemented the other solutions I wrote, as they require a lot more code and need thinking of making the randomness well distributed . – android developer Sep 26 '17 at 09:59
  • @androiddeveloper check out my edit to my answer. It doesn't require a lot of code: the probability with which you'd need to take an element in order to uniformly sample the list turns out to be trivial to calculate. – Andy Turner Sep 26 '17 at 23:48
  • @AndyTurner Yes I forgot about this solution. Thanks. – android developer Sep 27 '17 at 06:51