2

I have a list of indices:

idx = [1,4,5]

and a list of interest:

mylist = ['a','b','c','d','e','f']

I want to get all the elements out of mylist whose index is not in idx.

So the result should be:

['a','c','d']

I would also be fine with splitting mylist into ['a','c','d'] and ['b','e','f'], since I am going to use both of them anyway.

A numpy version is ok, though I am actually having just two lists for now.

SwiftsNamesake
  • 1,540
  • 2
  • 11
  • 25
Make42
  • 12,236
  • 24
  • 79
  • 155

3 Answers3

9

With numpy you can use mask arrays.

import numpy as np
x=np.array(mylist)
mask=np.full(len(mylist),True,dtype=bool)
mask[idx]=False
y=x[mask]
z=x[~mask]
print(y,z)
Dima Chubarov
  • 16,199
  • 6
  • 40
  • 76
  • 1
    Thank you very much! Your solution has even been benchmarked below and I am using your answer now when my lists are large. Since I can only choose one answer I still gave that one to MSeifert, since he covers a lot more information by now. – Make42 Aug 14 '17 at 13:38
4

If you want to keep both and you're okay with using a function, you could use:

def partition_on_index(it, indices):
    indices = set(indices)   # convert to set for fast lookups
    l1, l2 = [], []
    l_append = (l1.append, l2.append)
    for idx, element in enumerate(it):
        l_append[idx in indices](element)
    return l1, l2

There's one trick involved here and it's the l_append[idx in indices]. The idx in indices will return a boolean representing if the condition is True or False. And because booleans are a subclass of integers in Python these can be interpreted as 0 (in case of False) or 1 (if True) and are thus valid indices for the l_append tuple.

The l_append tuple thus serves as convenient alternative for an if ... else ... inside the loop and it hoists the lookup of the append methods, thus improving the speed of the function a bit. So this is equivalent to:

def partition_on_index_probably_slower(it, indices):
    indices = set(indices)   # convert to set for fast lookups
    l1, l2 = [], []
    for idx, element in enumerate(it):
        if idx in indices:
            l2.append(element)
        else:
            l1.append(element)
    return l1, l2

But let's see an example how the first function works:

>>> idx = [1,4,5]
>>> mylist = ['a','b','c','d','e','f']
>>> l1, l2 = partition_on_index(mylist, idx)
>>> l1
['a', 'c', 'd']
>>> l2
['b', 'e', 'f']

Timings:

I used the framework from this answer to measure the performance:

import random
import numpy as np

def func1(mylist, idx):
    idx = set(idx)
    in_idx, not_in_idx = [], []
    for i, e in enumerate(mylist):
        (not_in_idx, in_idx)[i in idx].append(e)
    return in_idx, not_in_idx

def partition_on_index(it, indices):
    indices = set(indices)   # convert to set for fast lookups
    l1, l2 = [], []
    l_append = (l1.append, l2.append)
    for idx, element in enumerate(it):
        l_append[idx in indices](element)
    return l1, l2

def func2(mylist, idx):
    x = np.asarray(mylist)
    mask = np.ones(len(mylist), dtype=bool)
    mask[idx] = False
    return x[mask], x[~mask]

# Timing setup
timings = {func1: [], partition_on_index: [], func2: []}
sizes = [2**i for i in range(1, 20, 2)]

# Timing
for size in sizes:
    mylist = list(range(size))
    indices = list({random.randint(0, size-1) for _ in range(size//2)})
    for func in timings:
        res = %timeit -o func(mylist, indices)
        timings[func].append(res)

enter image description here

So for small lists the function partition_on_index performs best. But if the input contains several thousand items (or more) ou might get faster results using the NumPy approach from Dmitri Chubarov. However all approaches perform asymptotically equally and the performance only differs by a factor 2-5.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Thanks you, that is actually pretty cool. Now I just have to my list is larger than 1100 elements and choose the method accordingly. – Make42 Aug 14 '17 at 13:32
  • @Make42 Yeah, something like that. The plot is log-log so it's actually ~2000 elements where NumPy starts winning. – MSeifert Aug 14 '17 at 13:36
2

One solution I found with list comprehension is:

idx = set(idx)
[e for i, e in enumerate(mylist) if i not in idx]

If I want both lists, as I mentioned in my question I can do this:

idx = set(idx)
in_idx, not_in_idx = [], []
for i, e in enumerate(mylist):
    (not_in_idx, in_idx)[i in idx].append(e)

and will I get:

>>> not_in_idx
['a', 'c', 'd']
>>> in_idx
['b', 'e', 'f']
MSeifert
  • 145,886
  • 38
  • 333
  • 352
Make42
  • 12,236
  • 24
  • 79
  • 155
  • 1
    Make `idx` a set first for speed. – Alex Hall Aug 12 '17 at 10:41
  • @AlexHall: Like this? – Make42 Aug 12 '17 at 15:55
  • No, that's every iteration of the comprehension. Do it before. – Alex Hall Aug 12 '17 at 15:58
  • Don't convert it to a `set` everywhere (once is enough), your current code makes a whole "set" (pun intended) of unnecessary copies. Also if you're interested in how you could "hoist" the `append` method lookup you might want to take a look at the [other answer](https://stackoverflow.com/a/45650496). – MSeifert Aug 12 '17 at 16:15
  • @MSeifert: Ah, thanks, that was not intentional. I just forgot to delete it. Not sure what you mean by "hoisting a method". Do you mean, that it is outside the for-loop? What is the advantage? – Make42 Aug 12 '17 at 16:45
  • @Make42 The only reason is performance because it doesn't need to look the method up in each iteration. :) – MSeifert Aug 12 '17 at 16:46
  • @MSeifert: So yours is faster? Benchmark? (I am not so sure how to do benchmarking for this.) – Make42 Aug 12 '17 at 17:20
  • @Make42 Yes, I added the benchmarks in my [answer](https://stackoverflow.com/a/45650496/5393381) (at the bottom). – MSeifert Aug 13 '17 at 13:47