8

Say I have a list in Python 3.X. I use list comprehension to return a subset of that list--- is there an easy/Pythonic way to "pop" that subset off of the original list (so the elements in that subset are no longer in the original list after being returned)? Thanks!

Example (I need help in defining my_func):

a = [1, 2, 2, 3, 4, 5]
a, b = my_func(a, *kwargs)

I'd then want:

a = [1, 2, 2]
b = [3, 4, 5]

Note: I do not want to pop the values off one at a time, but rather the whole subset at once. "Pop" may not be the best terminology, but I don't know what is.

The best way I can think of doing it is:

 temp = [idx for idx, val in enumerate(a) if val > 2]  
 b = [a[i] for i in temp]  
 a = [val for idx,val in enumerate(a) if idx not in temp]  

Obviously, I would prefer something more elegant and efficient. I want to avoid going over the list twice.

EDIT: I think this question is unique because I am not simply concerned with splitting the list - I want to modify the original list. Splitting and assigning one of those splits to the original list variable is a possible solution, but I was hoping there might be a way to do it without explicitly assigning to the original list variable (similar to something like b.append(a.pop()))

TaylerJones
  • 242
  • 2
  • 15
  • 1
    Do add sample code, input and output – Bhargav Rao Jan 12 '15 at 17:48
  • 5
    If the data in the `list` is actually appropriate for a `set` you can use its operators like `intersection`, `difference`, etc. – Brian Cain Jan 12 '15 at 17:49
  • 1
    what you mean by pop ? did you want to grub that subset again or you want to remove it ? – Mazdak Jan 12 '15 at 17:50
  • 1
    It sounds like he wants two subsets - the filtered one in the new list and what's wasn't filtered in the old list. – user3467349 Jan 12 '15 at 17:51
  • @user3467349 is correct. – TaylerJones Jan 12 '15 at 17:54
  • @BrianCain , I was a little sloppy with my terminology. It is not a set in the python sense (e.g. a != set(a)). I have added some inputs/outputs to clarify. – TaylerJones Jan 12 '15 at 17:54
  • 1
    Related: http://stackoverflow.com/questions/949098/python-split-a-list-based-on-a-condition – zch Jan 12 '15 at 17:58
  • @zch It seems like a duplicate- – user3467349 Jan 12 '15 at 18:41
  • @user3467349 I specified I do not want to go over the list twice, which the other answer does not do. IMO the answer here that is similar to the picked answer in the link is better in that it is more in line with DRY because of the lambda. – TaylerJones Jan 12 '15 at 18:43
  • Yes, the other question is also asking about how to do it without iterating twice - I agree that the accepted answer on there isn't good. But for example this one is: http://stackoverflow.com/a/12135169/3467349 – user3467349 Jan 12 '15 at 18:50
  • wow, it is an absolute shame that that answer is buried. Very elegant, efficient, and readable. Also, fwiw, my question also has to do with modifying the original list, not just splitting it in two. While splitting and reassigning the appropriate split to the original list is a possible solution, I was hoping there was a way to do something like b.append(a.pop()) where I do not need to explicitly modify the original list. – TaylerJones Jan 12 '15 at 19:35
  • There is - see the updated first part of my answer - but I like solution in gnibbler's answer better tbh. – user3467349 Jan 12 '15 at 20:24
  • gnibbler's answer uses the old hacky way of implementing the ternary condition `good.append(x) if x in goodvals else bad.append(x)`. And from there, you might as well just do an explicit multi-line `if x > 2: b.append(x); else: c.append(x)`. – zehnpaard Jan 12 '15 at 21:15
  • @TaylerJones: btw, is there a reason against creating two new lists and assigning one to the original variable? Cooking up something that mutates the original list is possible (see Loupi's answer) but because you're removing items that aren't always at the end of the list, it causes various elements in the list to be shuffled every time you remove something. Time-wise it's got bad algorithmic complexity. – zehnpaard Jan 12 '15 at 21:38

4 Answers4

7

Just filter out the unwanted items using list comprehension:

a = [1, 2, 2, 3, 4, 5]
condition = lambda x: x > 2
b = [x for x in a if condition(x)]      # b == [3, 4, 5] 
a = [x for x in a if not condition(x)]  # a == [1, 2, 2]

Update

If you are worrying about efficiency then here is another approach which scans the list only once:

def classify(iterable, condition):
    """ Returns a list that matches the condition and a list that
    does not """
    true_list = []
    false_list = []
    for item in iterable:
        if condition(item):
            true_list.append(item)
        else:
            false_list.append(item)
    return true_list, false_list

a = [1, 2, 2, 3, 4, 5]
b, a = classify(a, lambda x: x > 2)
print(a)  # [1, 2, 2]
print(b)  # [3, 4, 5]

Update 2

I did not realize that my update looks almost the same to that of user3467349, but believe it or not, I did not cheat :-)

Hai Vu
  • 37,849
  • 11
  • 66
  • 93
  • This seems like the most readable, and is in line with the top answer in the related link in the comments of my question (except it didn't use the lambda, which is a real nice touch IMO). The only issue is I would prefer not to go over the lists twice, and this way is a violation of DRY principles. While it is probably the most readable version, this section of my code is run-time critical (but not to the point of requiring another language). – TaylerJones Jan 12 '15 at 18:20
  • You're iterating over the same list twice, for no reason. – user3467349 Jan 12 '15 at 18:32
  • @user3467349: It's not for *no* reason; it's for readability. Sometimes that is more important than execution speed. But sometimes execution speed is more important. Or sometimes memory. It depends. But readability is definitely worth something. – John Y Jan 12 '15 at 18:38
  • I will say that what I like about this answer is I could easily throw it in a function with the lambda function as an argument. The function then could be used for different object types and corresponding conditions. – TaylerJones Jan 12 '15 at 18:38
4

The one line solution (please don't actually use this):

a = [7,4,2, 1, 5 , 11, 2]
[x for x in (a.pop() for i in range(len(a))) if (lambda x: True if x> 2 \
else a.insert(0, x))(x)]

More Reasonable Solutions

In general - if you're checking to see what a value is before popping - that's an read at an index, and then a pop (not to mention the funkiness of keeping tracking of iterating over an array of changing length), which seems to make the point of not having two new lists moot.

So I would just add to one of two new lists i.e.

a = [1, 2, 2, 3, 4, 5]
b=[]; c=[];
for i in range(len(a)): 
    if x > 2: b.append(x) 
    else: c.append(x) 

Of course you can technically pop from a and insert back into a you would do a.insert(0, x) instead of c.append(x) - but it's generally a bad idea to manipulate a list you are looping over.

Another alternative is a sorted list and a bisect i.e.

a = sorted(a) 
ind = bisect.bisect(a, 2) 
#Now both lists are here 
a, b= a[:ind], a[ind:]

Which method is preferable really depends on what you plan to do with the lists afterwards.

user3467349
  • 3,043
  • 4
  • 34
  • 61
  • I like this one best so far. For my purposes, I would need one more line for replacing a with c at the end. If no one comes up with some simpler syntax, I will pick this one. – TaylerJones Jan 12 '15 at 18:13
  • Dont think that second one works for my case with duplicate values (I had edited my original question which erroneously called my list a set; in reality, it has duplicate values), but it is still an interesting technique. – TaylerJones Jan 12 '15 at 18:25
  • Bisect has bisect left and right, so you can specify where it should bisect in the case of repeating values. Normally I'd use bisect if I wanted to be able to access different subsets of the list (i.e. >67%, >90%) etc. – user3467349 Jan 12 '15 at 18:30
  • 1
    Note that the first sample only works because the original input list is sorted, or rather, because it fulfils the weaker condition of 'all numbers less than or equal to 2 appear before all numbers greater than 2'. – zehnpaard Jan 12 '15 at 20:51
  • My next suggestion is that you don't really need to a.pop now :) – zehnpaard Jan 12 '15 at 21:04
  • Yeah basically pop is worse than sort + slice :( – user3467349 Jan 12 '15 at 21:15
  • You can just loop over the items in `a` now, rather than repeat `a.pop()`. – zehnpaard Jan 12 '15 at 21:18
  • I was testing this code to learn from it. Maybe languages change a lot in 1 to 2 years. The last line on Python 2.7 does not output anything. So I wrapped each element in `print()` and added a `print(b)` line after the assignment statement and then I could see what it was doing. Also, `bisect` would not run until I added an `import bisect` line to the top of the code. These are all helpful examples though which is why I was tinkering with them in the first place. – TMWP Mar 24 '17 at 21:33
  • @user3467349 Even though it's not nice code, would you mind elaborating on the one-liner? I'm intrigued. Sorry for the late post – IMOPUTFIE Jun 27 '23 at 13:43
4

Keep in mind this is Python, not C, and sometimes the way it works is not necessarily intuitive. This means you have to verify your assumptions. I've done this using IPython's built-in %%timeit magic.

a = [1,2,2,3,4,5]
%timeit b=[x for x in a if x > 2]\
1000000 loops, best of 3: 362 ns per loop
%timeit c=[x for x in a if not (x > 2)]
1000000 loops, best of 3: 371 ns per loop

So just under 800 ns for both, but we're iterating over the loop twice. Surely we don't need to do that? How about a couple of the above methods? Let's start with classify, which is pretty straightforward and only walks over the list once:

%timeit b, a = classify(a, lambda x: x > 2)
1000000 loops, best of 3: 1.89 µs per loop

Wow, even though it only walks over the loop once, it takes more than twice as long as the simple solution above, which walks it twice. Let's try a slight variation on the other solution proposed above:

%%timeit
b, c = [], []
for x in a:
    b.append(x) if x > 2 else a.append(x)
1000000 loops, best of 3: 1.2 µs per loop

Better, but it's still slower than our 'naive'/'inefficient' implementation. Maybe a little different formulation would be better:

%%timeit
b, c = [], []
for x in a:
    if x > 2:
        b.append(x)
    else:
        c.append(x)
1000000 loops, best of 3: 983 ns per loop

Hmm, that seems almost the same, but is a little faster. Still doesn't beat the naive implementation. Let's get really clever, maybe make it even faster:

%%timeit
b, c = [], []
for x in a:
    (b, c)[x > 2].append(x)
1000000 loops, best of 3: 1.28 µs per loop

So, what we saw is that despite our desire not to iterate the loop twice, we don't seem to be able to improve on just doing two list comprehensions. List comprehensions are sort of 'under the hood' in Python, and so for many things they are going to faster than anything you come up with.

Now, the x < 2 comparison is cheap - no function calls, simple math. There will be a point where it starts to make sense. Let's create a more expensive comparison function - we'll calculate the factorial (and do it inefficiently):

def factorial(x):
    if x in (0,1):
        return 1
    else:
        return x * factorial(x-1)

%timeit b = [x for x in a if factorial(x) > 6]
100000 loops, best of 3: 3.47 µs per loop
%timeit c = [x for x in a if not factorial(x) > 6]
100000 loops, best of 3: 3.53 µs per loop

So, obviously the times went up a good bit - now are at about 7uS for everything.

Let's try some of our other examples now:

%timeit b, c = classify(a, lambda x: factorial(x) > 6)
100000 loops, best of 3: 5.05 µs per loop

%%timeit
b, c = [], []
for x in a:
    if factorial(x) > 6:
        b.append(x)
    else:
        c.append(x)
100000 loops, best of 3: 4.01 µs per loop

%%timeit
b, c = [], []
for x in a:
    (b, c)[factorial(x) > 6].append(x)
100000 loops, best of 3: 4.59 µs per loop

The lesson from all of this: When dealing with python & efficiency, it's usually a good idea to just try it out in a console, but very often the naive implementations are reasonably performant & the easiest to read. And a quick test will tell you if trying to optimize is actually worth it; you can often make it less readable AND slower if you're not careful....

Addendum: Someone commented that we need longer lists, because we're measuring function call overhead more than performance. They have a good point, but the times show about the same relationship on my machine:

In [16]: a = range(100000)

In [17]: random.shuffle(a)

In [18]: %timeit b = [x for x in a if x > 50000]
100 loops, best of 3: 5.2 ms per loop

In [19]: %timeit c = [x for x in m if not x > 50000]
100 loops, best of 3: 5.18 ms per loop

In [20]: %timeit c = [x for x in m if x <= 50000]
100 loops, best of 3: 5.35 ms per loop

In [21]: %%timeit
   ....: b, c = [], []
   ....: for x in a:
   ....:     if x > 50000:
   ....:         b.append(x)
   ....:     else:
   ....:         c.append(x)
   ....:
100 loops, best of 3: 12.7 ms per loop

Note that if I change the comparison to x > 2 (instead of x > 50000) the second loop sped up to about 11.4ms. But still, that's barely any faster than the naive implementation. That said, it's probably the one I'd prefer - it's still easy to read, and it's not slower (or significantly so). And code tends to get more complex over time, so when you later add another comparison, or change it to a function call, it's easier to do, and the performance will suffer less in this version than the naive version.

But again: This isn't to say 'use list comprehensions' (though they are very often a good idea), but verify your assumptions.

Corley Brigman
  • 11,633
  • 5
  • 33
  • 40
  • Setting an array size to a bigger one quickly shows a difference between iterating twice and once. All you're timing is the overhead of the the various function calls - not iteration speed. – user3467349 Jan 12 '15 at 21:41
  • @user3467349: Are you sure? I get similar results when I try this on a randomly generated list of 1000 elements – zehnpaard Jan 12 '15 at 21:53
  • @zehnpaard the regular (b).append is 25%+ faster for me with a list of just 100 values. That isn't to say Corley isn't right in the gist of what he was saying - but from the question being asked here (on SF) I assumed the user wasn't satisfied with Naively running two iterations over the same array to split lists. You might say that then they should use a c-call instead of python - but there are a lot of cases (i.e. NLP) where you already have python like objects and using a c-call would not be so simple. Certainly the more complex the conditional the slower the double iteration. – user3467349 Jan 12 '15 at 23:04
  • @user3467349: Very interesting, I wonder what's causing the difference. I'll try running the tests outside of IPython when I have the time. The thing about list comprehensions is that as Corley says, they can work under the hoods in a more optimized C code fashion (sometimes) so two passes w/ them might be cheaper than one pass in native Python. – zehnpaard Jan 13 '15 at 01:57
  • 1
    I added some data from longer lists - from what I see here, the general consensus is still the same - a for loop that walks once, is about the same performance as a pair of list comprehensions. However, you are right - the short example probably was skewed, and a longer example would be less overhead and more of what you want. And my point wasn't really 'use list comprehensions', but actually measure it - there have been a few times I got an idea to speed something up that had the opposite effect; I only knew because I measured. – Corley Brigman Jan 13 '15 at 13:02
  • 1
    @CorleyBrigman Actually, I see what's going on - I'm running numbers in python3 and you are in python2. – user3467349 Jan 13 '15 at 21:36
0

If you want to 'pop' the values from the list a and append those values to b I would follow this approach:

    a = [1,2,2,3,4,5]
    b = []

    for i in a[:]:
        if i > 2:
            b.append(i)
            a.remove(i)

    #a = [1,2,2]
    #b = [3,4,5]

This script is especially helpful if list a was not in a particular order.

Loupi
  • 550
  • 6
  • 14
  • `a.remove(i)` is an O(n) operation, so the whole code is O(n^2) – zehnpaard Jan 12 '15 at 21:03
  • @zehnpaard can you please explain to me what that means as I am quite new to the programming world. – Loupi Jan 13 '15 at 07:27
  • Algorithmic complexity is a tough topic to cover in a comment, but the rough gist is that if you double the elements in `a`, other solutions take twice as long, but your code takes four times as long in the worst case. Triple the input and your code takes 9 times as long. The reason is that while it's not immediately obvious, you actually have two loops nested - one is hidden in `a.remove(i)`, where everytime you remove an item, every element after that item will need to be moved in memory. `a.remove(i)` is a costly operation and generally should be avoided if possible. – zehnpaard Jan 13 '15 at 18:54
  • You may want to start out with [this SO post](http://stackoverflow.com/questions/487258/plain-english-explanation-of-big-o) – zehnpaard Jan 13 '15 at 19:02