17

This is a follow up to a similar question which asked the best way to write

for item in somelist:
    if determine(item):
         code_to_remove_item

and it seems the consensus was on something like

somelist[:] = [x for x in somelist if not determine(x)]

However, I think if you are only removing a few items, most of the items are being copied into the same object, and perhaps that is slow. In an answer to another related question, someone suggests:

for item in reversed(somelist):
    if determine(item):
        somelist.remove(item)

However, here the list.remove will search for the item, which is O(N) in the length of the list. May be we are limited in that the list is represented as an array, rather than a linked list, so removing items will need to move everything after it. However, it is suggested here that collections.dequeue is represented as a doubly linked list. It should then be possible to remove in O(1) while iterating. How would we actually accomplish this?

Update: I did some time testing as well, with the following code:

import timeit
setup = """
import random
random.seed(1)
b = [(random.random(),random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
        return (x[1]>.45) and (x[1]<.5)
"""
listcomp = """
c[:] = [x for x in b if tokeep(x)]
"""
filt = """
c = filter(tokeep, b)
"""
print "list comp = ", timeit.timeit(listcomp,setup, number = 10000)
print "filtering = ", timeit.timeit(filt,setup, number = 10000)

and got:

list comp =  4.01255393028
filtering =  3.59962391853
Community
  • 1
  • 1
highBandWidth
  • 16,751
  • 20
  • 84
  • 131
  • 5
    Without a specific performance objective, and without this being the only bottleneck in your program, this is a fair amount of hand-wringing over very little. What are the `timeit` measurements for these operations? – S.Lott Apr 21 '11 at 15:19
  • I retested with your code using `timeit` instead of `cProfile`. I was able to replicate your results. `filter` worked fastest. I need to dig deeper and figure out why cProfile showed filter was fastest. I deleted my answer because based on the differences between cProfile vs. timeit will only add to the confusion for future knowledge-seekers ;) The only conclusion that can be drawn with certainty is that `list.remove` sucks. – Praveen Gollakota Apr 21 '11 at 17:01
  • 1
    If you want to compare apples to apples, you should make the list comp use "c = ..." or make the filter use "c[:] = ...". – Daniel Stutzbach Apr 22 '11 at 04:33
  • @Daniel Stutzbach: `c = [x for x in b if tokeep(x)]` takes the same time that is strange. – jfs Apr 24 '11 at 00:08
  • I ran this same test in Python 3.4.1 and got the following result:list comp = 3.207311153242804 filtering = 0.0023201516740933847 – nakedfanatic Feb 09 '15 at 02:16
  • 1
    @nakedfanatic: `filter()` returns an iterator on Python 3. Call list() on the result, to get items. – jfs Oct 03 '15 at 00:56

7 Answers7

25

The list comprehension is the asymptotically optimal solution:

somelist = [x for x in somelist if not determine(x)]

It only makes one pass over the list, so runs in O(n) time. Since you need to call determine() on each object, any algorithm will require at least O(n) operations. The list comprehension does have to do some copying, but it's only copying references to the objects not copying the objects themselves.

Removing items from a list in Python is O(n), so anything with a remove, pop, or del inside the loop will be O(n**2).

Also, in CPython list comprehensions are faster than for loops.

Daniel Stutzbach
  • 74,198
  • 17
  • 88
  • 77
  • You're right this is asymptotically the fastest. I still wonder if there is a way to remove the linear overhead of copying. – highBandWidth Apr 21 '11 at 15:17
  • 3
    @highBandWidth: The copy is cache-friendly and will be very, very fast. You'd be better off trying to make sure determine() is fast. – Daniel Stutzbach Apr 21 '11 at 15:22
  • I am so confused by this code... why isn't it: somelist = [x for x in somelist if not determine(x)] – Garrett Berg Apr 23 '11 at 04:48
  • @Garrett Berg: Indeed. `somelist[:] =` triggers [`list_ass_slice()`](http://hg.python.org/cpython/file/020ebe0be33e/Objects/listobject.c#l573) that does a lot of work. – jfs Apr 23 '11 at 23:56
  • It is sort of annoying why the default `remove` or `pop` isn'tm implemented this way. – krenerd Apr 22 '21 at 07:47
  • @krenerd The default `remove` and `pop` mutate the list, whereas a list comprehension creates a new one. – Elias Hasle May 05 '21 at 09:57
3

If you need to remove item in O(1) you can use HashMaps

Kevin
  • 552
  • 2
  • 8
  • 17
  • You're right, and I might have to fall back on that, but hashmaps are overkill since I don't need O(1) removing for random items, just O(1) removal while iterating over that item, which is possible for linked lists. – highBandWidth Apr 21 '11 at 15:08
  • Can you point to next item? for example if you have someting like this: item1 , item2 , item3 and you have to remove item2 you can do something like this: item1->next=item3. I hope that now I understood correct – Kevin Apr 21 '11 at 15:29
3

Since list.remove is equivalent to del list[list.index(x)], you could do:

for idx, item in enumerate(somelist):
    if determine(item):
        del somelist[idx]

But: you should not modify the list while iterating over it. It will bite you, sooner or later. Use filter or list comprehension first, and optimise later.

Cat Plus Plus
  • 125,936
  • 27
  • 200
  • 224
  • That's what the second code snippet I mentioned does, with `reversed` solving the problem of safety. But remove or del somelist[idx] is still O(N), isn't it? – highBandWidth Apr 21 '11 at 15:12
  • @highBandWidth: `reversed` does not solve the "safety" problem. Nothing does except making an actual copy. – S.Lott Apr 21 '11 at 15:18
  • @S.Lott -- doesn't it though? I would have thought it would ensure that all `del` side-effects occur to portions of the list that have already been traversed. (Of course one still shouldn't do it that way -- scales badly.) – senderle Apr 21 '11 at 15:32
  • @senderle: (1) Try it. (2) Try to avoid assuming this kind of thing. Measurements are more valuable and useful that "would have thought". – S.Lott Apr 21 '11 at 15:46
  • @S.Lott, I completely agree with you, so I'll rephrase: I have tried it before and it seemed to work. Perhaps there was a test-case that I missed. – senderle Apr 21 '11 at 16:00
3

A deque is optimized for head and tail removal, not for arbitrary removal in the middle. The removal itself is fast, but you still have to traverse the list to the removal point. If you're iterating through the entire length, then the only difference between filtering a deque and filtering a list (using filter or a comprehension) is the overhead of copying, which at worst is a constant multiple; it's still a O(n) operation. Also, note that the objects in the list aren't being copied -- just the references to them. So it's not that much overhead.

It's possible that you could avoid copying like so, but I have no particular reason to believe this is faster than a straightforward list comprehension -- it's probably not:

write_i = 0
for read_i in range(len(L)):
    L[write_i] = L[read_i]
    if L[read_i] not in ['a', 'c']:
         write_i += 1
del L[write_i:]
senderle
  • 145,869
  • 36
  • 209
  • 233
2

I took a stab at this. My solution is slower, but requires less memory overhead (i.e. doesn't create a new array). It might even be faster in some circumstances!

This code has been edited since its first posting

I had problems with timeit, I might be doing this wrong.

import timeit
setup = """

import random
random.seed(1)
global b
setup_b = [(random.random(), random.random()) for i in xrange(1000)]
c = []
def tokeep(x):
        return (x[1]>.45) and (x[1]<.5)


# define and call to turn into psyco bytecode (if using psyco)
b = setup_b[:]
def listcomp():
   c[:] = [x for x in b if tokeep(x)]
listcomp()

b = setup_b[:]
def filt():
   c = filter(tokeep, b)
filt()

b = setup_b[:]
def forfilt():
   marked = (i for i, x in enumerate(b) if tokeep(x))
   shift = 0
   for n in marked:
      del b[n - shift]
      shift += 1
forfilt()

b = setup_b[:]
def forfiltCheating():
   marked = (i for i, x in enumerate(b) if (x[1] > .45) and (x[1] < .5))

   shift = 0
   for n in marked:
      del b[n - shift]
      shift += 1
forfiltCheating()

"""

listcomp = """
b = setup_b[:]

listcomp()
"""

filt = """
b = setup_b[:]

filt()
"""

forfilt = """
b = setup_b[:]

forfilt()
"""

forfiltCheating = '''
b = setup_b[:]

forfiltCheating()
'''

psycosetup = '''

import psyco
psyco.full()


'''

print "list comp = ", timeit.timeit(listcomp, setup, number = 10000)
print "filtering = ", timeit.timeit(filt, setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, setup, number = 10000)


print '\nnow with psyco \n'
print "list comp = ", timeit.timeit(listcomp, psycosetup + setup, number = 10000)
print "filtering = ", timeit.timeit(filt, psycosetup + setup, number = 10000)
print 'forfilter = ', timeit.timeit(forfilt, psycosetup + setup, number = 10000)
print 'forfiltCheating = ', timeit.timeit(forfiltCheating, psycosetup + setup, number = 10000)

And here are the results

list comp =  6.56407690048
filtering =  5.64738512039
forfilter =  7.31555104256
forfiltCheating =  4.8994679451

now with psyco 

list comp =  8.0485959053
filtering =  7.79016900063
forfilter =  9.00477004051
forfiltCheating =  4.90830993652

I must be doing something wrong with psyco, because it is actually running slower.

Garrett Berg
  • 2,585
  • 1
  • 22
  • 21
  • You are doing the setup steps (making the b array), all over again, and timing it. I was only timing the actual filter/delete steps. What is the difference between forfilter and forfiltCheating? – highBandWidth Apr 21 '11 at 21:18
  • Ohh I get it, in *Cheating, you use the literal rather than the function object, and it seems to be faster! – highBandWidth Apr 21 '11 at 21:20
  • Apparently the timeit module does not re-run the initialization steps. Since I am changing the array itself, it has be be included in the timing. I've never used timit before, I was kind of just trying to get some results. – Garrett Berg Apr 23 '11 at 02:51
0

elements are not copied by list comprehension

this took me a while to figure out. See the example code below, to experiment yourself with different approaches

code

You can specify how long a list element takes to copy and how long it takes to evaluate. The time to copy is irrelevant for list comprehension, as it turned out.

import time
import timeit
import numpy as np

def ObjectFactory(time_eval, time_copy):
    """
    Creates a class

    Parameters
    ----------
    time_eval : float
        time to evaluate (True or False, i.e. keep in list or not) an object
    time_copy : float
        time to (shallow-) copy an object. Used by list comprehension.

    Returns
    -------
    New class with defined copy-evaluate performance

    """
    class Object:

        def __init__(self, id_, keep):
            self.id_ = id_
            self._keep = keep

        def __repr__(self):
            return f"Object({self.id_}, {self.keep})"

        @property
        def keep(self):
            time.sleep(time_eval)
            return self._keep

        def __copy__(self):  # list comprehension does not copy the object
            time.sleep(time_copy)
            return self.__class__(self.id_, self._keep)

    return Object

def remove_items_from_list_list_comprehension(lst):
    return [el for el in lst if el.keep]

def remove_items_from_list_new_list(lst):
    new_list = []
    for el in lst:
        if el.keep:
            new_list += [el]
    return new_list

def remove_items_from_list_new_list_by_ind(lst):
    new_list_inds = []
    for ee in range(len(lst)):
        if lst[ee].keep:
            new_list_inds += [ee]
    return [lst[ee] for ee in new_list_inds]

def remove_items_from_list_del_elements(lst):
    """WARNING: Modifies lst"""
    new_list_inds = []
    for ee in range(len(lst)):
        if lst[ee].keep:
            new_list_inds += [ee]
    for ind in new_list_inds[::-1]:
        if not lst[ind].keep:
            del lst[ind]

if __name__ == "__main__":
    ClassSlowCopy = ObjectFactory(time_eval=0, time_copy=0.1)
    ClassSlowEval = ObjectFactory(time_eval=1e-8, time_copy=0)

    keep_ratio = .8
    n_runs_timeit = int(1e2)
    n_elements_list = int(1e2)

    lsts_to_tests = dict(
        list_slow_copy_remove_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
        list_slow_copy_keep_many = [ClassSlowCopy(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
        list_slow_eval_remove_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
        list_slow_eval_keep_many = [ClassSlowEval(ii, np.random.rand() > keep_ratio) for ii in range(n_elements_list)],
    )

    for lbl, lst in lsts_to_tests.items():
        print()
        for fct in [
            remove_items_from_list_list_comprehension,
            remove_items_from_list_new_list,
            remove_items_from_list_new_list_by_ind,
            remove_items_from_list_del_elements,
        ]:

            lst_loc = lst.copy()
            t = timeit.timeit(lambda: fct(lst_loc), number=n_runs_timeit)
            print(f"{fct.__name__}, {lbl}: {t=}")


output

remove_items_from_list_list_comprehension, list_slow_copy_remove_many: t=0.0064229519994114526
remove_items_from_list_new_list, list_slow_copy_remove_many: t=0.006507338999654166
remove_items_from_list_new_list_by_ind, list_slow_copy_remove_many: t=0.006562008995388169
remove_items_from_list_del_elements, list_slow_copy_remove_many: t=0.0076057760015828535

remove_items_from_list_list_comprehension, list_slow_copy_keep_many: t=0.006243691001145635
remove_items_from_list_new_list, list_slow_copy_keep_many: t=0.007145451003452763
remove_items_from_list_new_list_by_ind, list_slow_copy_keep_many: t=0.007032064997474663
remove_items_from_list_del_elements, list_slow_copy_keep_many: t=0.007690364996960852

remove_items_from_list_list_comprehension, list_slow_eval_remove_many: t=1.2495998149970546
remove_items_from_list_new_list, list_slow_eval_remove_many: t=1.1657221479981672
remove_items_from_list_new_list_by_ind, list_slow_eval_remove_many: t=1.2621939050004585
remove_items_from_list_del_elements, list_slow_eval_remove_many: t=1.4632593330024974

remove_items_from_list_list_comprehension, list_slow_eval_keep_many: t=1.1344162709938246
remove_items_from_list_new_list, list_slow_eval_keep_many: t=1.1323430630000075
remove_items_from_list_new_list_by_ind, list_slow_eval_keep_many: t=1.1354237199993804
remove_items_from_list_del_elements, list_slow_eval_keep_many: t=1.3084568729973398
Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58
-1
 import collections
 list1=collections.deque(list1)
 for i in list2:
   try:
     list1.remove(i)
   except:
     pass

INSTEAD OF CHECKING IF ELEMENT IS THERE. USING TRY EXCEPT. I GUESS THIS FASTER

Prabhas
  • 1
  • 3