What is the most efficient way to remove a group of indices from a list of numbers in Python 2.7?

Question

So I was wondering how I can, using Python 2.7, most efficiently take a list of values used to represent indices like this: (but with a length of up to 250,000+)

indices = [2, 4, 5]

and remove that list of indices from a larger list like this: (3,000,000+ items)

numbers = [2, 6, 12, 20, 24, 40, 42, 51]

to get a result like this:

[2, 6, 20, 42, 51]

I'm looking for an efficient solution more than anything else. I know there are many ways to do this, however that's not my problem. Efficiency is. Also, this operation will have to be done many times and the lists will both get exponentially smaller. I do not have an equation to represent how much smaller they will get over time.

edit:

Numbers must remain sorted in a list the entire time or return to being sorted after the indices have been removed. The list called indices can either be sorted or not sorted. It doesn't even have to be in a list.

Do you want to alter the list in place or create a new list or doesn't matter? — FogleBird, Nov 27 '12 at 00:34
See this question : http://stackoverflow.com/questions/6486450/python-compute-list-difference — voscausa, Nov 27 '12 at 00:42
@enginefree Each item in the list called indices represents an index in the list below it called numbers. So I'm trying to remove numbers[2] numbers[4] and numbers[5] from numbers. — Steven Hicken, Nov 27 '12 at 00:44

score 7 · Answer 1 · edited Apr 09 '23 at 18:00

7

You may want to consider using the numpy library for efficiency (which if you're dealing with lists of integers may not be a bad idea anyway):

>>> import numpy as np
>>> a = np.array([2, 6, 12, 20, 24, 40, 42, 51])
>>> np.delete(a, [2,4,5])
array([ 2,  6, 20, 42, 51])

Notes on np.delete: http://docs.scipy.org/doc/numpy/reference/generated/numpy.delete.html

It might also be worth at looking at keeping the main array as is, but maintaining a masked array (haven't done any speed tests on that either though...)

edited Apr 09 '23 at 18:00

Glorfindel

21,988
13
81
109

answered Nov 27 '12 at 00:41

Jon Clements

138,671
33
247
280

I tested this using my test suite in my answer and it's not significantly faster than a list comprehension. (0.53 seconds versus 0.59 seconds) – FogleBird Nov 27 '12 at 00:51
Last time I tried installing numpy, I couldn't find a 64-bit build for Mac OS X Lion. Only 32-bit. And I would really prefer to use 64-bit. I could be wrong though. They may have a 64-bit build that I haven't seen. – Steven Hicken Nov 27 '12 at 00:52
@StevenHicken might also be worth looking at masked arrays – Jon Clements Nov 27 '12 at 00:57
I just took a look at masked arrays. They may come in use but I would have to redesign the algorithm I'm working on. – Steven Hicken Nov 27 '12 at 02:16

score 6 · Accepted Answer · answered Nov 27 '12 at 01:41

6

I have a suspicion that taking whole slices between the indices might be faster than the list comprehension

def remove_indices(numbers, indices):
    result = []
    i=0
    for j in sorted(indices):
        result += numbers[i:j]
        i = j+1
    result += numbers[i:]
    return result

answered Nov 27 '12 at 01:41

John La Rooy

295,403
53
369
502

Good point actually. Also, is the sorted() method necessary in the for loop? indices is already sorted. I haven't used python in a while so maybe I'm not getting something. – Steven Hicken Nov 27 '12 at 01:56
Also, I'm about to test it. – Steven Hicken Nov 27 '12 at 01:59
Much faster... 0.15 seconds. – FogleBird Nov 27 '12 at 02:02
I sorta considered this too but was too lazy to try it. Well done! – FogleBird Nov 27 '12 at 02:03
2

@StevenHicken, You don't need the `sorted()` if indices is always already sorted. It won't hurt much to leave it in though because timsort is linear over a presorted list. – John La Rooy Nov 27 '12 at 02:05
This is a good amount faster than FogleBird's solution. I couldn't seem to get his improved function to work but his original took 1.05 seconds and yours took 0.75 seconds on my laptop. – Steven Hicken Nov 27 '12 at 02:05
@StevenHicken: My improved one assumed that indices was already a set. – FogleBird Nov 27 '12 at 02:07
In any case, gnibbler's is still faster. – FogleBird Nov 27 '12 at 02:08
I added a graph with benchmarks of different options, this is by far the best. – bradley.ayers Nov 27 '12 at 02:24
@FogleBird That'd make a lot more sense. No wonder I couldn't get it to work. – Steven Hicken Nov 27 '12 at 02:51

bradley.ayers · Answer 3 · 2012-11-27T03:41:56.640

4

Another option:

>>> numbers = [2, 6, 12, 20, 24, 40, 42, 51]
>>> indicies = [2, 4, 5]
>>> offset = 0
>>> for i in indicies:
...     del numbers[i - offset]
...     offset += 1
...
>>> numbers
[2, 6, 20, 42, 51]

Edit:

So after being hopelessly wrong on this answer, I benchmarked each of the different approaches:

enter image description here

Horizontal axis is number of items, vertical is time in seconds.

The fastest option is using slicing to build a new list (from @gnibbler):

def using_slices(numbers, indices):
    result = []
    i = 0
    for j in indices:
        result += numbers[i:j]
        i = j + 1
    result += numbers[i:]

Surprisingly it and "sets" (@Eric) beat numpy.delete (@Jon Clements)

Here's the script I used, perhaps I've missed something.

edited Nov 27 '12 at 03:41

answered Nov 27 '12 at 00:36

bradley.ayers

37,165
14
93
99

Consider that each `del` operation is resizing the list. – FogleBird Nov 27 '12 at 00:42
1

@JonClements, interestingly masked array seem to perform poorly. – bradley.ayers Nov 27 '12 at 02:58

FogleBird · Answer 4 · 2012-11-27T00:58:38.133

Here's my first approach.

def remove_indices(numbers, indices):
    indices = set(indices)
    return [x for i, x in enumerate(numbers) if i not in indices]

Here's a test module to test it under the conditions you specified. (3 million elements with 250k to remove)

import random

def create_test_set():
    numbers = range(3000000)
    indices = random.sample(range(3000000), 250000)
    return numbers, indices

def remove_indices(numbers, indices):
    indices = set(indices)
    return [x for i, x in enumerate(numbers) if i not in indices]

if __name__ == '__main__':
    import time
    numbers, indices = create_test_set()
    a = time.time()
    numbers = remove_indices(numbers, indices)
    b = time.time()
    print b - a, len(numbers)

It takes around 0.6 seconds on my laptop. You might consider making the indices a set beforehand if you'll be using it multiple times.

(FWIW bradley.ayers solution took longer than I was willing to wait.)

Edit: This is slightly faster: (0.55 seconds)

def remove_indices(numbers, indices):
    return [numbers[i] for i in xrange(len(numbers)) if i not in indices]

Thanks for testing all of those. As of right now, this seems like the best route to go. — Steven Hicken, Nov 27 '12 at 01:04
I'm going to wait for a little while to see if another solution shows up. — Steven Hicken, Nov 27 '12 at 01:05

score 2 · Answer 5 · answered Nov 27 '12 at 00:41

2

Not that efficient, but a different approach

indices = set([2, 4, 5])

result = [x for i,x in enumerate(numbers) if i not in indices]

answered Nov 27 '12 at 00:41

Eric

95,302
53
242
374

This is how I would do it. It has the added benefit of not requiring an external dependency. – Thane Brimhall Nov 27 '12 at 00:53

score 1 · Answer 6 · answered Nov 21 '16 at 14:21

1

Another different approach to achieve that:

>>> numbers = [2, 6, 12, 20, 24, 40, 42, 51]
>>> indices = [2, 4, 5]
>>> [item for item in numbers if numbers.index(item) not in indices]
[2, 6, 20, 42, 51]

answered Nov 21 '16 at 14:21

ettanany

19,038
9
47
63

What is the most efficient way to remove a group of indices from a list of numbers in Python 2.7?

6 Answers6

Linked

Related