11

I have several instances in a python script (v2.6) where I need to modify a list in-place. I need to pop values from the list in response to interactive input from the user and would like to know the cleanest method of doing this. Currently I have the very dirty solutions of a) setting items in the list that I want to remove to False and removing them with a filter or list comprehension or b) generating an entirely new list while going through the loop, which seems to be needlessly adding variables to the namespace and taking up memory.

An example of this problem is as follows:

for i, folder in enumerate(to_run_folders):
    if get_size(folder) < byte_threshold:
        ans = raw_input(('The folder {0}/ is less than {1}MB.' + \
                    ' Would you like to exclude it from' + \
                    ' compression? ').format(folder, megabyte_threshold))
        if 'y' in ans.strip().lower():
            to_run_folders.pop(i)

I would like to look at each folder in the list. If the current folder is less than a certain size, I want to ask the user if they want to exclude it. If they do, pop the folder from the list.

The problem with this routine is that if I iterate over the list, I get unexpected behavior and early termination. If I iterate over a copy by slicing, pop doesn't pull the right value because the indices are shifted and the problem compounds as more items are popped. I have a need for dynamic list adjustment of this kind in other areas of my script as well. Is there any clean method for this kind of functionality?

Patrick
  • 303
  • 3
  • 8
  • Stick with your *dirty* solution. It is clearly the fastest one, and the less problematic. – pepr Apr 24 '12 at 22:58

5 Answers5

9

You can loop over the list backwards, or use a view object.

See https://stackoverflow.com/a/181062/711085 for how to loop over a list backwards. Basically use reversed(yourList) (which happens creates a view object which visits backwards).

If you require indexing, you could do reversed(enumerate(yourList)), but that would effectively create a temporary list in memory because enumerate would need to run before reversed could kick in. You will need to either do index manipulation, or to do this:

for i in xrange(len(yourList)-1, -1, -1):
    item = yourList[i]
    ...

Even cleaner: reversed is aware of range, so you can do this in python3, or in python2 if you use xrange instead:

for i in reversed(range(len(yourList))):  
    item = yourList[i]
    ...

(proof: you can do next(reversed(range(10**10))), but this will crash your computer if using python2)

Community
  • 1
  • 1
ninjagecko
  • 88,546
  • 24
  • 137
  • 145
4

You can loop over it backwards

Backwards:

x = range(10)
l = len(x)-1  # max index

for i, v in enumerate(reversed(x)):
    if v % 2:
        x.pop(l-i)  # l-1 is the forward index
jdi
  • 90,542
  • 19
  • 167
  • 203
  • It should be noted that, in spite of my answer that `reversed(enumerate(yourList))` will make a copy of the list, this solution with `enumerate(reversed(x))` does work efficiently without making a copy of the list. You can do `i = len(x)-1-i` as the first line of the for loop to fix the indexing for extra readability. – ninjagecko Apr 24 '12 at 21:07
  • @ninjagecko: I was actually about to comment on your answer about that. I wasn't sure that either of them make copies since they both return generators on the given iterable argument. Am I wrong? Doesn't one just yield from the other either way? – jdi Apr 24 '12 at 21:10
  • You are welcome to try typing in `next(reversed(enumerate(10**8)))` into a prompt and hope that you don't use up all your computer's memory. =) As I said in my answer, `reversed` must wait until `enumerate` has seen (and therefore cached) the entire list before it can return the last tuple. – ninjagecko Apr 24 '12 at 21:37
  • @ninjagecko: `reversed(enumerate(yourList))` is not even possible. It raises a TypeError complaining that `reversed()` must be given a sequence. So, apparently you can't even do it that way to begin with :-) – jdi Apr 24 '12 at 22:06
  • Ah, indeed. It is however quite possible to do so in python3; I guess the python2 equivalent would be using `list` to turn the result of `enumerate(...)` into a sequence. – ninjagecko Apr 24 '12 at 22:20
  • @ninjagecko: tried it in python3.2 and it doesnt work there either. – jdi Apr 24 '12 at 23:59
2

OK, I have measured the solution. The reversed solutions are about the same. The forward while loop is about 4 times slower. BUT! The Patrik's dirty solution is about 80 times faster for the list of 100,000 random integers [bug in Patrik2 corrected]:

import timeit
import random

def solution_ninjagecko1(lst):
    for i in xrange(len(lst)-1, -1, -1):
        if lst[i] % 2 != 0:    # simulation of the choice
            del lst[i]
    return lst

def solution_jdi(lst):
    L = len(lst) - 1
    for i, v in enumerate(reversed(lst)):
        if v % 2 != 0:
            lst.pop(L-i)  # L-1 is the forward index
    return lst

def solution_Patrik(lst):
    for i, v in enumerate(lst):
        if v % 2 != 0:         # simulation of the choice
            lst[i] = None
    return [v for v in lst if v is not None]

def solution_Patrik2(lst):
    ##buggy lst = [v for v in lst if v % 2 != 0]
    ##buggy return [v for v in lst if v is not None]
    # ... corrected to
    return [v for v in lst if v % 2 != 0]

def solution_pepr(lst):
    i = 0                      # indexing the processed item
    n = 0                      # enumerating the original position
    while i < len(lst):
        if lst[i] % 2 != 0:    # simulation of the choice
            del lst[i]         # i unchanged if item deleted
        else:
            i += 1             # i moved to the next
        n += 1
    return lst

def solution_pepr_reversed(lst):
    i = len(lst) - 1           # indexing the processed item backwards
    while i > 0:
        if lst[i] % 2 != 0:    # simulation of the choice
            del lst[i]         # i unchanged if item deleted
        i -= 1                 # i moved to the previous
    return lst

def solution_steveha(lst):
    def should_keep(x):
        return x % 2 == 0
    return filter(should_keep, lst)

orig_lst = range(30)
print 'range() generated list of the length', len(orig_lst)
print orig_lst[:20] + ['...']   # to have some fun :)

lst = orig_lst[:]  # copy of the list
print solution_ninjagecko1(lst)

lst = orig_lst[:]  # copy of the list
print solution_jdi(lst)

lst = orig_lst[:]  # copy of the list
print solution_Patrik(lst)

lst = orig_lst[:]  # copy of the list
print solution_pepr(lst)

orig_lst = [random.randint(1, 1000000) for n in xrange(100000)]
print '\nrandom list of the length', len(orig_lst)
print orig_lst[:20] + ['...']   # to have some fun :)

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_ninjagecko1(lst)',
                  'from __main__ import solution_ninjagecko1, lst',
                  number=1)
print 'solution_ninjagecko1: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_jdi(lst)',
                  'from __main__ import solution_jdi, lst',
                  number=1)
print 'solution_jdi: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_Patrik(lst)',
                  'from __main__ import solution_Patrik, lst',
                  number=1)
print 'solution_Patrik: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_Patrik2(lst)',
                  'from __main__ import solution_Patrik2, lst',
                  number=1)
print 'solution_Patrik2: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_pepr_reversed(lst)',
                  'from __main__ import solution_pepr_reversed, lst',
                  number=1)
print 'solution_pepr_reversed: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_pepr(lst)',
                  'from __main__ import solution_pepr, lst',
                  number=1)
print 'solution_pepr: ', t

lst = orig_lst[:]  # copy of the list
t = timeit.timeit('solution_steveha(lst)',
                  'from __main__ import solution_steveha, lst',
                  number=1)
print 'solution_steveha: ', t

It prints on my console:

c:\tmp\_Python\Patrick\so10305762>python a.py
range() generated list of the length 30
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, '...']
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

random list of the length 100000
[915411, 954538, 794388, 847204, 846603, 454132, 866165, 640004, 930488, 609138,
 333405, 986073, 318301, 728151, 996047, 117633, 455353, 581737, 55350, 485030,
'...']
solution_ninjagecko1:  2.41921752625
solution_jdi:  2.45477176569
solution_Patrik:  0.0468565138865
solution_Patrik2:  0.024270403082
solution_pepr_reversed:  2.43338888043
solution_pepr:  9.11879694207

So, I tried with longer list. Using only twice as long one makes a big difference (on my old computer). Patrik's dirty solution behaves very nicely. It is about 200 times faster than the reversed solutions:

random list of the length 200000
[384592, 170167, 598270, 832363, 123557, 81804, 319315, 445945, 178732, 726600,
516835, 392267, 552608, 40807, 349215, 208111, 880032, 520614, 384119, 350090, 
'...']
solution_ninjagecko1:  17.362140719
solution_jdi:  17.86837545
solution_Patrik:  0.0957998851809
solution_Patrik2:  0.0500024444448
solution_pepr_reversed:  17.5078452708
solution_pepr:  52.175648581

[Added after ninjagecko's comments]

The corrected Patrik2 solution is about twice faster than 2-stage Patrick solution.

To simulate not so frequent deletion of the elements, the test like if v % 2 != 0: was changed to if v % 100 == 0:. Then about 1 % of items should be deleted. It is obvious that it takes less time. For 500,000 random integers in the list, the results are the following:

random list of the length 500000
[403512, 138135, 552313, 427971, 42358, 500926, 686944, 304889, 916659, 112636,
791585, 461948, 82622, 522768, 485408, 774048, 447505, 830220, 791421, 580706, 
'...']
solution_ninjagecko1:  6.79284210703
solution_jdi:  6.84066913532
solution_Patrik:  0.241242951269
solution_Patrik2:  0.162481823807
solution_pepr_reversed:  6.92106007886
solution_pepr:  7.12900522273

Patrick's solution is stil about 30 times faster.

[Added 2012/04/25]

Another solution that works in-place, that loops forward, and that is as fast as Patrick's solution. It does not move all the tail when the element is deleted. Instead, it moves the wanted elements to their final position and then cuts the unused tail of the list.

def solution_pepr2(lst):
    i = 0
    for v in lst:
        lst[i] = v              # moving the element (sometimes unneccessary)
        if v % 100 != 0:        # simulation of the choice
            i += 1              # here will be the next one
    lst[i:] = []                # cutting the tail of the length of the skipped
    return lst

# The following one only adds the enumerate to simulate the situation when
# it is needed -- i.e. slightly slower but with the same complexity.        
def solution_pepr2enum(lst):
    i = 0
    for n, v in enumerate(lst):
        lst[i] = v              # moving the element (sometimes unneccessary)
        if v % 100 != 0:        # simulation of the choice
            i += 1              # here will be the next one
    lst[i:] = []                # cutting the tail of the length of the skipped
    return lst

Compared with the above solutions for v % 100 != 0:

random list of the length 500000
[533094, 600755, 58260, 295962, 347612, 851487, 523927, 665648, 537403, 238660,
781030, 940052, 878919, 565870, 717745, 408465, 410781, 560173, 51010, 730322, 
'...']
solution_ninjagecko1:  1.38956896051
solution_jdi:  1.42314502685
solution_Patrik:  0.135545530079
solution_Patrik2:  0.0926935780151
solution_pepr_reversed:  1.43573239178
solution_steveha:  0.122824246805
solution_pepr2:  0.0938177241656
solution_pepr2enum:  0.11096263294
pepr
  • 20,112
  • 15
  • 76
  • 139
  • 1
    Interesting. Unfortunately you didn't check the `reversed(range(len(yourList)))` solution (though if you did, it would be about the same as the first solution). However I don't believe the benchmarks are reasonable; in those benchmarks, you are removing **half** the elements. In that case, I would just do `[x for i,x in enumerate(lst) if i%2!=0]` and ignore the in-place requirement; this achieves a result twice as fast the fastest solution you benchmarked. Furthermore the solution you provide is not the "dirty" solution because it is not in-place and uses `[...]`. – ninjagecko Apr 24 '12 at 23:16
  • It is true that if you delete only 20% of the elements of the list, the Patrick method is about 4x as fast, and that if you delete only 10% of the elements of the list, the Patrick method is about 2x as fast, according to your tests, though unfortunately I do not have the time to double-check them. – ninjagecko Apr 24 '12 at 23:22
  • @ninjagecko: I was also surprised. I tried to delete about 1 % of the random elements (see the edited text). Patrick's method is still about 30 times faster for 500,000 elements. The question is: *Why should we insist on an in-place solution?* – pepr Apr 24 '12 at 23:50
  • Bug in Patrick2 corrected -- about twice faster. – pepr Apr 25 '12 at 00:16
  • 1
    This is kind of a silly test. Its weighted different depending on how many deletions actually happen. When I replace inner loop logic with `pass`, and replaced case-tests with `if False`, etc... to prevent any deletions from happening, @ninjagecko comes out ahead, then mine, then the others. – jdi Apr 25 '12 at 01:07
  • I just added in `solution_steveha()` which is a simple use of `filter()`. On my computer this is the second-fastest, about half the speed of the fastest (Patrik2). I didn't add in a benchmark speed because all the times were different on my computer. – steveha Apr 25 '12 at 01:45
  • @steveha: Again though, try this test when no elements are being deleted. Patricks crumbles when it has to deal with a larger resulting data set. – jdi Apr 25 '12 at 01:51
  • @jdi, the simple and clean solution using `filter()` is faster than `solution_Patrik()`. It is slower than `solution_Patrik2()`, but that is a list comprehension with its own expression to do the filtering; the speed win there is due to the lack of a function call and thus is unrealistic. It's not really practical to write a list comprehension that is calling `raw_input()` as part of a chain of logic; it would sure be ugly if you tried it. I would prefer a simple and clean solution even if it were slower, but I have shown that it really isn't. – steveha Apr 25 '12 at 02:14
  • @steveha: The filter() solution is basically the same as the list comprehension. It is also not in-place solution (the same as the solution_Patrick2). It is more general, so it pays for the function call. I agree that combination with `raw_input` makes it funny. However, the goal to filter out the unwanted elements is not that boring ;) – pepr Apr 25 '12 at 07:09
  • I have added new, fast, in-place, forward moving solution as solution_pepr2. Please, have a look ;) – pepr Apr 25 '12 at 07:10
  • @pepr, the overhead of a function call makes code run more slowly. However, the complexity of the filtering function (it invokes `raw_input()`!) means that the filtering will almost certainly be packaged up as a function. Therefore, I do not think it is valuable to benchmark solutions that do not call a function. Personally I prefer the list comprehension solution, but the `filter()` solution is the fastest one in your benchmarks that calls a function to do the filtering. If you make `solution_pepr2()` call a function, it will become slower; I tested it and confirmed it. – steveha Apr 25 '12 at 08:20
  • Of course, it is possible to avoid the function in this case; just for fun, I wrote it and added it to my answer. Yuck! The function is better. – steveha Apr 25 '12 at 08:29
  • @steveha: I like all the solutions as one can learn from them. They force to think about the problem. I tried to make them comparable for measuring the time. I agree that it is usually not the most important thing. But it can be, sometimes. Actually the pepr2 can be slightly modified to resemble syntactically the `filter()`. The main difference is that it works in-place -- which may be both plus and minus depending on the situation. – pepr Apr 25 '12 at 11:20
1

Currently I have the very dirty solutions of a) setting items in the list that I want to remove to False and removing them with a filter or list comprehension or b) generating an entirely new list while going through the loop, which seems to be needlessly adding variables to the namespace and taking up memory.

Actually, it is not that dirty solution. How long the list typically is? Even creating the new list should not be that much memory consumig as the list contains only references.

You can also loop in the while loop and enumerate for yourself, performing del lst[n] if the user decides (possibly counting separately the positions in the original).

pepr
  • 20,112
  • 15
  • 76
  • 139
0

The best way to handle this, the most "Pythonic" way, is in fact to loop over your list and make a new list that contains just the folders you want. Here is how I would do it:

def want_folder(fname):
    if get_size(folder) >= byte_threshold:
        return True
    ans = raw_input(('The folder {0}/ is less than {1}MB.' + \
                ' Would you like to exclude it from' + \
                ' compression? ').format(folder, megabyte_threshold))
    return 'y' not in ans.strip().lower()

to_run_folders = [fname for fname in to_run_folders if want_folder(fname)]

If your list is truly large, then maybe you need to worry about the performance of this solution and use dirty tricks. But if your list is that large, it might be sort of crazy to have a human answering yes/no question about all the files that might show up.

Is performance an actual problem or just a sort of nagging worry? Because I'm pretty sure the above code is fast enough for practical use, and it is easier to understand and modify than tricky code.

EDIT: @jdi suggested, in the comments, using itertools.ifilter() or filter()

I tested, and this should actually be faster than what I showed above:

to_run_folders = filter(want_folder, to_run_folders)

I just copied @pepr's benchmarking code, and tested the solution using filter() as shown here. It was second fastest overall, with only Patrik2 being faster. Patrik2 was twice as fast, but again, any data set small enough that it is practical to have a human answering yes/no questions is probably small enough that a factor of two won't matter too much.

EDIT: Just for fun, I went ahead and wrote a version that is a pure list comprehension. It has a single expression to evaluate, no Python function call.

to_run_folders = [fname for fname in to_run_folders
        if get_size(fname) >= mb_threshold or
                'y' not in raw_input(('The folder {0}/ is less than {1}MB.' +
                ' Would you like to exclude it from compression? '
                ).format(fname, mb_threshold)).strip().lower()

]

Yuck! I much prefer making a function.

steveha
  • 74,789
  • 21
  • 92
  • 117
  • This is basically `itertools.ifilter(want_folder, to_run_folders)` which is faster and more efficient. – jdi Apr 25 '12 at 01:12
  • `itertools.ifilter()` would indeed be a good replacement for the generator expression I suggested. But what if you want the list? Is `list(itertools.ifilter(want_folder, to_run_folders))` faster than the list comprehension? I just checked with `timeit` and it was about 25% faster with a simple test involving filtering a long list of `int` values. – steveha Apr 25 '12 at 01:25
  • If you want the results directly as a list then use `filter`. It just depends how the resulting value will be used later. – jdi Apr 25 '12 at 01:26
  • Are these faster than the list comprehension because the listcomp is repeatedly rebinding the variable name? Both the listcomp and `filter()` are builtins to Python, and thus both written in C, so that's the major thing that can explain the speed difference (at least that I can think of). – steveha Apr 25 '12 at 01:31
  • I think generally they should be somewhat close if they are both calling python functions. But if the filter is calling a builtin, then its faster. In this specific case, they both have overhead of calling a python function. – jdi Apr 25 '12 at 01:40
  • In this case, the filter is not calling a builtin, but it is still faster. I think the reason is that the list comprehension has to keep rebinding the variable and then looking up the variable, while `filter()` just evaluates the function object once and then applies it directly to values from the input sequence without binding any names at all. – steveha Apr 25 '12 at 02:08