0

How do I remove the same items from a list? Say

A= [1, 2, 3, 8, 7, 8, 8, 7, 6]

And I want to remove all 8 how should I do this? if the number is not in order?

JS.
  • 14,781
  • 13
  • 63
  • 75
bugbytes
  • 315
  • 1
  • 5
  • 13
  • Are you sure a list is the data structure you want to use? You might want a set, or maybe a dict or something depending on what else your code is doing. – user2357112 Aug 01 '13 at 22:49
  • @user2357112: Considering that he wants to remove _all_ of the 8s, not all but one, and that he wants to still have two 7s, and in the original order… I don't think a set is what he wants. But you're right, it's hard to tell what he wants without more information about what he's trying to do. – abarnert Aug 01 '13 at 22:54
  • We don't know whether the original order needs to be maintained, or whether duplicates are important. If the important thing is whether or not the structure has an 8 in it, a set would allow fast transitions between "the set has an 8 in it" and "the set doesn't have an 8 in it", as well as fast checking of "does the set have an 8 in it". – user2357112 Aug 01 '13 at 23:09
  • @user2357112: Given that the input has a specific order, and has duplicates (and not just the 8s), I think it's a pretty good guess that these are relevant to the OP. – abarnert Aug 02 '13 at 19:20

3 Answers3

4

The easy way to do this is to build a new list, containing all of the items that aren't 8. For example:

B=[item for item in A if item != 8]

If you really want to do it in-place (for example, because some other object has a reference to the same list as A, and you want that other object to see the changes), you can. But it's trickier. For example, if you delete by indices, you will have to go backward at some point (because when you remove the first 8, all of the later 8s have new indices, so you always have to delete the last one first):

indices = [index for index, item in enumerate(A) if item==8]
for index in reversed(indices):
    del A[index]

Or you could just keep trying to remove until it fails, but this is both ugly and slow:

while True:
    try:
        A.remove(8)
    except ValueError:
        break

In fact, you're often better off still creating a new list, and then just mutating A into a copy of that new list:

A[:]=[item for item in A if item != 8]

For a performance test, I ran this code against all of the proposed answers. The algorithms tested are:

  • revind: My first in-place algorithm in this answer.
  • whiledel: Chris Barker's first algorithm.
  • slide: Chris Barker's second algorithm (the fixed version).
  • genexpr: My last algorithm, but using a genexpr instead of a listcomp.
  • listcomp: My last algorithm.
  • filterer: My last algorithm, but using filter (note that this means it builds a list in 2.x, but a genexpr-like iterator in 3.x).

Note that if you don't actually need to mutate A—and usually you don't, you can just rebind the name—the copy-then-mutate algorithms become even simple and faster. But I didn't test that here.

I also didn't fully test the "while True: remove until error" algorithm, or Chris Barker's suggested variation in the comments, because they're obviously going to be so much slower that it's not worth testing. However, a very quick test shows that the variation is about 2x slower, as expected, and that both are orders of magnitude slower than anything else in almost any test case.

Anyway, the test is removing 0s from a random list of length 100K or 1M (at 1/10th as many reps), with values ranging from 0 up to 16, 256, or 65536 (the fewer distinct values, the higher the percentage of hits to remove).

If you're using CPython, the listcomp version is always fastest, especially so when N is large. In you're using PyPy, the in-place algorithms can beat it when N is huge and M is tiny, but in that case, they're all very fast. The fancy slide algorithm eliminates the quadratic behavior of the other in-place algorithms, but it also slows things down for the simple cases, so there's really no place where it's a clear winner. (It's also by far the most complicated—which is probably why it was the only one that wasn't right on the first attempt.) If you're absolutely sure you're only going to be removing a tiny number of copies, and you're using PyPy, consider the whiledel solution; in any other use case, or when you don't know the use case for sure, I'd use the listcomp.

          64-bit python CPython 3.3.0
          16 values    256 values   65536 values
          100K  1000K  100K  1000K  100K  1000K
revind    0.188 17.3   0.085 1.23   0.074 0.080
whiledel  0.324 19.3   0.206 1.36   0.199 0.203
slide     0.091  0.54  0.097 0.54   0.095 0.538
genepxr   0.094  0.11  0.100 0.11   0.099 0.108
listcomp  0.070  0.08  0.073 0.08   0.071 0.079
filterer  0.081  0.09  0.080 0.09   0.835 0.088

          64-bit python CPython 2.7.2
          16 values    256 values   65536 values
          100K  1000K  100K  1000K  100K  1000K
revind    0.198 17.1   0.089 1.23   0.088 0.955
whiledel  0.345 19.8   0.233 1.36   0.234 0.243
slide     0.095  0.54  0.099 0.55   0.095 0.551
genepxr   0.092  0.11  0.097 0.11   0.107 0.116
listcomp  0.091  0.09  0.099 0.08   0.105 0.114
filterer  0.122  0.23  0.132 0.09   0.135 0.150

          64-bit python PyPy 1.9.0 (Python 2.7.2)
          16 values    256 values   65536 values
          100K  1000K  100K  1000K  100K  1000K
revind    0.266 28.5   0.027 1.97   0.018 0.013
whiledel  0.281 30.2   0.023 1.94   0.034 0.009
slide     0.022  0.39  0.015 0.022  0.006 0.018
genepxr   0.089  0.13  0.087 0.154  0.089 0.147
listcomp  0.052  0.08  0.057 0.073  0.052 0.073
filterer  0.054  0.07  0.053 0.078  0.048 0.074

It's a bit hard to predict the performance of slide. In large-N/small-M cases, you'd expect it to blow whiledel away, but it's actually slower. But if you think about it: that algorithm effectively replaces M linear moves with N simple copies. While the former is O(NM) and the latter is O(N), the multiplier on looping in C (or, even better, inside memmove) is so much smaller than looping in Python that you can't ignore it unless N/M is huge (at which point all of the solutions are so fast that it scarcely matters). So, doing M Python loops and NM C loops can easily beat doing N Python loops.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 1
    Or you could do `while 8 in A: A.remove(8)` – Chris Barker Aug 01 '13 at 23:58
  • @ChrisBarker: True. That's twice as slow as the `for` loop, but if you've already decided that quadratic performance instead of linear is good enough, I doubt you'll care about doubling the time, so… good point. – abarnert Aug 02 '13 at 00:21
  • Good point - see my answer for an in-place linear-time solution. – Chris Barker Aug 02 '13 at 18:03
  • @ChrisBarker: Your answer is exactly as linear (that is, not at all) as my `reversed(indices)` solution. Either way, you're doing N comparisons and M deletions; the fact that the `reversed` call also adds another M swaps doesn't change things; O(N+M) = O(N+2M). Except, of course, that the deletions themselves are not O(1), but O(N), so both solutions end up being O(NM). – abarnert Aug 02 '13 at 19:19
  • Ahh, you're right. I updated my answer with another stab at an in-place O(N) solution. I'd like to hear your thoughts about it – Chris Barker Aug 02 '13 at 19:55
  • Great answer and kudos for the benchmark of the answers. – DevLounge Aug 02 '13 at 22:04
2

Most of these answers suggest copying data. This can be undesired if your list is sufficiently large. You can easily use

while 8 in A: A.remove(8)

to do it without copying any data. However, this runs in quadratic time, which is also undesirable if your list is large. To do it in linear time and without copying any data, use:

def remove_all_from_list(L, n):
    i = 0
    while i < len(L):
        if L[i] == n:
            del L[i] # Do not increment i here, because L[i] will change
        else:
            i += 1

>>> A = [1, 2, 3, 8, 7, 8, 8, 7, 6]
>>> remove_all_from_list(A, 8)
>>> A
[1, 2, 3, 7, 7, 6]

Edit: @abarnert reminded me that del L[i] is O(N) so this is actually quadratic. Here's another attempt at an O(N) in-place solution...

def remove_all_from_list(L, n):
    # indices takes *worse-case* O(N) space, but typical-case much less
    indices = [i for i, x in enumerate(L) if x==n]
    indices_seen = 0
    num_indices = len(indices)
    for i in xrange(len(L)):
        while (indices_seen < num_indices and 
            i + indices_seen == indices[indices_seen]):
            indices_seen += 1
        if i+indices_seen >= len(L):
            break
        L[i] = L[i+indices_seen]            
    L[-indices_seen:] = []

This will do all the shuffling at the end, so each element gets moved at most once. I realize this will take exactly as much time as abarnert's copying method. I'm just trying to think of ways to reduce memory usage in case you have a very large list.


Final edit: Speed tests (not as comprehensive as @abarnert's)

import random
L = range(30)*10000
random.shuffle(L)
from copy import copy

for fn in [remove_1, remove_2, remove_3, remove_4, remove_5, remove_6]:
    print fn.__name__
    %timeit fn(copy(L), 8)

remove_1 # listcomp
10 loops, best of 3: 39.1 ms per loop
remove_2 # revind
1 loops, best of 3: 1.7 s per loop
remove_3 # try: remove; except: break
1 loops, best of 3: 65.7 s per loop
remove_4 # while n in L: L.remove(n)
1 loops, best of 3: 129 s per loop
remove_5 # whiledel
1 loops, best of 3: 1.87 s per loop
remove_6 # slide
1 loops, best of 3: 227 ms per loop
Chris Barker
  • 2,279
  • 14
  • 15
  • This isn't actually linear time, because each del from the middle of a list takes linear time (because the rest of the list has to be shuffled backward). Even ignoring that, if you time it, I'm willing to bet that it's actually slower than the reverse-indexing solutions, and that the same is true for equivalent C code. At any rate, the copying solutions _are_ linear, and a whole lot simpler to boot. – abarnert Aug 02 '13 at 19:18
  • I updated my answer. Since I'm creating a list of indices, I realize this could take up to O(N) space, but on a typical case it will be less. I don't think I'm missing any extra complexity here. Thoughts? – Chris Barker Aug 02 '13 at 19:58
  • The new version isn't actually correct. I haven't debugged it to see what's wrong, but running your first algorithm and all of mine against a list, I end up with 93578 values left, while your second one leaves 99992. Also, it seems to be still significantly slower than just building the copy and replacing `L[:]` in most, maybe every, case… but wait until I have complete test results before responding to that. – abarnert Aug 02 '13 at 20:39
  • I fixed the new version (it had problems with consecutive 8s). I'm running some speed tests now. – Chris Barker Aug 02 '13 at 21:10
  • OK, I've finished the tests. As expected, the `L[:] = [listcomp]` solution is usually fastest, sometimes by orders of magnitude. (It's linear, and its core loop is in C; it's going to be very hard to beat that.) – abarnert Aug 02 '13 at 21:20
  • Rerunning the tests with the fixed version: It does successfully remove the quadratic behavior… but it's significantly slower in all cases except the ones that are already ridiculously fast. – abarnert Aug 02 '13 at 21:31
  • One more point: While this solution means you only need O(M) temporary space instead of O(N), it also may mean you end up with O(N-M) _permanent_ wasted space, because (at least in some quick tests with CPython 3.3.0 and PyPy 1.9.0) Python never shrinks the list storage here. (Of course Python might not shrink the storage for my listcomp solution either… although it seems to do so. But if you don't really need to mutate `L`, which you usually don't, you'll just be keeping the copy and freeing the original, in which case that's not even an issue.) – abarnert Aug 02 '13 at 21:34
  • Thanks for doing all these tests - this is all very interesting. In practice I almost always use a generator expression, since it's the easiest to write, and now it looks like no matter how hard I try to optimize my solutions to this problem (in pure python), the simplest answer is best! – Chris Barker Aug 02 '13 at 21:49
  • 1
    That would be `O(M)`, not `O(N-M)`, since we're removing `M` elements. The new size is `N-M`, so the amount of wasted space is `N-(N-M) = M`. – Chris Barker Aug 02 '13 at 21:55
  • 1
    Right, it's O(M) wasted permanent storage, not O(N-M). Anyway, in practice, I rarely need to mutate `A` in the first place, so I'd use a genexpr or listcomp (depending on whether I need to iterate `B` once, or repeatedly) and just leave `A` as garbage. And `A = [...]` will clearly be faster than even `A[:] = [...]`. – abarnert Aug 02 '13 at 22:05
  • Given your results, I'm glad I didn't try running the two trivial quadratic-while loops. I think remove_4 would have taken longer than all of my test runs put together, including the time to write out the bash for loops to drive them and the code to format the results into a chart and paste it here… Anyway, I'm glad you wrote the "slide" code, because I would have just left it as "something you could do if you wanted, but I'm too lazy". :) – abarnert Aug 02 '13 at 22:08
0
In [13]: A= [1, 2, 3, 8, 7, 8, 8, 7, 6]

In [14]: [i for i in A if i!=8]
Out[14]: [1, 2, 3, 7, 7, 6]

In [15]: filter(lambda i: i!=8, A)
Out[15]: [1, 2, 3, 7, 7, 6]
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • 3
    If you're going to use `filter`, you could just pass either `(8).__ne__` or `partial(ne, 8)` instead of a `lambda`. I'm not sure whether it's more readable, but it's certainly more "functional". :) – abarnert Aug 01 '13 at 22:53
  • One quick note that I'd forgotten about: In 2.x, there is no `int.__ne__`; it relies on the fallback to `int.__cmp__`. However, `int.__cmp__` happens to be truthy iff !=, so, while it's a bit hacky, you can use `(8).__cmp__` here. But, from a quick test, it isn't much faster than the lambda or partial, and I think the extra readability cost of having to understand the old-style comparison makes it a much worse idea than the 3.x equivalent. – abarnert Aug 02 '13 at 21:48