3

Through this code I get very long performance (over 24h) working on large lists (~150M list elements inside, with 4 string elements each). I need to delete around 66M tuples from it:

def erase_random_elements(elements_list, iterable_random_numbers):
    for i in sorted(iterable_random_numbers, reverse = True):
        elements_list.pop(i)
    return elements_list

It seems I've got enough RAM for it, so I don't need to chunk the list. How can I do it faster?

Pawel Osipowski
  • 514
  • 1
  • 4
  • 8

3 Answers3

3

Don't use list.pop(i) as it takes O(N) time for each delete operation (the Delete Item on the linked page). Since you have enough memory you can create a new list:

def erase_random_elements(elements_list, iterable_random_numbers):
    s = set(iterable_random_numbers)
    elements_list[:] = [x for i, x in enumerate(elements_list) if i not in s]
    #return elements_list -> not required actually because we're updating the list in-place

Note that here I used elements_list[:] because in your original function you're updating the actual list instead of creating a new one and that means all the references to that list will be intact. But if that is not required then you can simple remove the [:], un-comment the return line and assign the returned list from the function to a variable.

Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
3

Ashwini has provided a sound solution for using lists but I'd also like to suggest using numpy instead of the builtin Python lists.

The example below using array indexing (using another array for indexing) as well as ~ inverting the array indexing.

import numpy as np

N = 100000
a = 10*np.arange(N) # Generate your large array.
b = np.arange(0,N,2) # Generate some indexes to remove from the array.

a2 = a[~b] # Use ~b to remove all elements that correspond to the indexes within b

Comparing the two solutions (Ashwini's list comprehension and the numpy method above) using %timeit in IPython shows that using numpy is significantly (~100x) faster:

In [28]: %timeit a2 = a[~b]
1000 loops, best of 3: 1.37 ms per loop
In [29]: %timeit a2 = [x for i, x in enumerate(a) if i not in b]
10 loops, best of 3: 170 ms per loop
Community
  • 1
  • 1
Ffisegydd
  • 51,807
  • 15
  • 147
  • 125
2

Since the values are not falsy (tuples), then you can do

def erase_random_elements(elements_list, iterable_random_numbers):
    for i in iterable_random_numbers:
        elements_list[i] = None
    elements_list[:] = filter(None, elements_list)