List comprehensions are fast.
deleted = [tokens[i] for i in sorted(indices_to_remove)]
tokens = [e for i, e in enumerate(tokens) if i not in indices_to_remove]
But first, given some ambiguity in the question, and for the sake of performance, some assumptions:
indices_to_remove
is a set
(O(1) membership test).
- The OP wants
deleted
to be in the same order as in the initial tokens
list.
- (Obviously) all indices are valid (
0 <= i < len(tokens)
for all i
). If not, well, you'll get an IndexError
.
Speed test
import random
from math import isqrt
def gen(n):
tokens = [f'x-{i}' for i in range(n)]
indices_to_remove = set(random.sample(range(n), isqrt(n)))
return tokens, indices_to_remove
# our initial solution
def f0(tokens, indices_to_remove):
deleted = [e for i, e in enumerate(tokens) if i in indices_to_remove]
tokens = [e for i, e in enumerate(tokens) if i not in indices_to_remove]
return deleted, tokens
# improved version, based on a comment by @KellyBundy
def f1(tokens, indices_to_remove):
deleted = [tokens[i] for i in sorted(indices_to_remove)]
tokens = [e for i, e in enumerate(tokens) if i not in indices_to_remove]
return deleted, tokens
# @thefourtheye version
def f4(tokens, indices_to_remove):
removed, retained = [], []
for idx, token in enumerate(tokens):
selected_list = removed if idx in indices_to_remove else retained
selected_list.append(token)
return removed, retained
# modified Kelly_1 to comply with our assumptions
# the original is certainly even faster, but modifies
# tokens in place, so harder to benchmark
def Kelly_1(tokens, indices_to_remove):
tokens = tokens.copy()
indices_to_remove = sorted(indices_to_remove)
return [*map(tokens.pop, reversed(indices_to_remove))][::-1], tokens
Test:
tokens, indices_to_remove = gen(1_000_000)
%timeit f0(tokens, indices_to_remove)
134 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f4(tokens, indices_to_remove)
111 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f1(tokens, indices_to_remove)
76.9 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here is a benchmark to see the runtime differences over scale (length of tokens; note, the number of tokens to delete is always ~sqrt(n)
:
import perfplot
perfplot.show(
setup=gen,
kernels=[f0, f1, f4, Kelly_1],
n_range=[2**k for k in range(3, 20)],
relative_to=1,
xlabel='n elements',
equality_check=lambda x, y: x == y,
)
We see something very interesting about Kelly_1
: even though it does some extra steps (copy the initial list, to protect the argument, sort the set to make it a sorted list), this absolutely obliterates all other solutions, up to about 100K elements. After that, the time ramps up significantly. I need to understand why.
Tested with Python 3.9.0.