1

I want to remove elements from an original list tokens which are at specified indices indices_to_remove and also keep the record of the deleted records but the problem is that when I have two indices to remove and I remove the record from first index, the second record cannot be removed correctly. Any idea on how to address this?

deleted_records = []
tokens = ['hi', 'what', 'is', 'there']
indices_to_remove = [2,3]
for index in indices_to_remove: 
    deleted_tokens.append(tokens[index])
    del tokens[index]
vkaul11
  • 4,098
  • 12
  • 47
  • 79

5 Answers5

2

Its generally not a good idea to modify a collection while iterating it.

So, the simplest way is to maintain two lists to gather the results like this

removed, retained = [], []
tokens = ['hi', 'what', 'is', 'there']
indices_to_remove = {2, 3} # Set for faster lookups

for idx, token in enumerate(tokens):
     selected_list = removed if idx in indices_to_remove else retained
     selected_list.append(token)

print(removed, retained)
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
2

List comprehensions are fast.

deleted = [tokens[i] for i in sorted(indices_to_remove)]
tokens = [e for i, e in enumerate(tokens) if i not in indices_to_remove]

But first, given some ambiguity in the question, and for the sake of performance, some assumptions:

  1. indices_to_remove is a set (O(1) membership test).
  2. The OP wants deleted to be in the same order as in the initial tokens list.
  3. (Obviously) all indices are valid (0 <= i < len(tokens) for all i). If not, well, you'll get an IndexError.

Speed test

import random
from math import isqrt

def gen(n):
    tokens = [f'x-{i}' for i in range(n)]
    indices_to_remove = set(random.sample(range(n), isqrt(n)))
    return tokens, indices_to_remove


# our initial solution
def f0(tokens, indices_to_remove):
    deleted = [e for i, e in enumerate(tokens) if i in indices_to_remove]
    tokens = [e for i, e in enumerate(tokens) if i not in indices_to_remove]
    return deleted, tokens


# improved version, based on a comment by @KellyBundy
def f1(tokens, indices_to_remove):
    deleted = [tokens[i] for i in sorted(indices_to_remove)]
    tokens = [e for i, e in enumerate(tokens) if i not in indices_to_remove]
    return deleted, tokens


# @thefourtheye version
def f4(tokens, indices_to_remove):
    removed, retained = [], []
    for idx, token in enumerate(tokens):
        selected_list = removed if idx in indices_to_remove else retained
        selected_list.append(token)
    return removed, retained


# modified Kelly_1 to comply with our assumptions
# the original is certainly even faster, but modifies
# tokens in place, so harder to benchmark
def Kelly_1(tokens, indices_to_remove):
    tokens = tokens.copy()
    indices_to_remove = sorted(indices_to_remove)
    return [*map(tokens.pop, reversed(indices_to_remove))][::-1], tokens

Test:

tokens, indices_to_remove = gen(1_000_000)

%timeit f0(tokens, indices_to_remove)
134 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit f4(tokens, indices_to_remove)
111 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit f1(tokens, indices_to_remove)
76.9 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here is a benchmark to see the runtime differences over scale (length of tokens; note, the number of tokens to delete is always ~sqrt(n):

import perfplot

perfplot.show(
    setup=gen,
    kernels=[f0, f1, f4, Kelly_1],
    n_range=[2**k for k in range(3, 20)],
    relative_to=1,
    xlabel='n elements',
    equality_check=lambda x, y: x == y,
)

We see something very interesting about Kelly_1: even though it does some extra steps (copy the initial list, to protect the argument, sort the set to make it a sorted list), this absolutely obliterates all other solutions, up to about 100K elements. After that, the time ramps up significantly. I need to understand why.

Tested with Python 3.9.0.

Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • ok, modified for no more `numpy`. Conclusions unchanged. – Pierre D Sep 21 '22 at 19:58
  • 1
    Ah dangit, of course. Sorry. I was thinking of my own solution, which does modify... – Kelly Bundy Sep 21 '22 at 20:14
  • 1
    If the indices are sorted, popping seems faster with your test case: [try](https://tio.run/##jVNbasMwEPzXKfQXqzjBIRRKIFfoBYwRIl41otaj0iYklJ49lSw5aZvQ1uAPj3ZmZ7Rrd8KdNasn589n6a2mWuCOKu2sR6rCm0dSPrwwvdWEoH0FE@iGtnJ2nL@rjxmV1lNFlUk1L1AtedM06WUdUaZXWwgcLfeg7QEiMQBWWW0RhHYDVITGJ5MHMFVuwVidHXzHCGOE9CCpbAoYy352YetRsocBEPpkFrLLmkIyCmavwQuESZcqmSPcSHWj0jX1v4SMxV/EPODem8ldXcRLqsc/UxV6@6CFK8ULZ10dDw7gA/TVDZMz1rXr9XzZXbolpdvx8DSfOO57IvHexxVBpUHhtCTRtNgPyBPqqQjjMYGjgy1GsWdrgKQb49cFWY1BEigTKJs6xs7ZttadIit7bNfl8hMSVStW8ofYMGKySuX3Lmqsc14ZrDKRzinGLgvOjdDAea6IsypGVRidZhPpuQTI3UYchvClQoQA6b8objaFMi3elZh9sPP5Ew) – Kelly Bundy Sep 21 '22 at 20:30
  • Ha! Your solution with `pop` looks like a wonderful idea. That's a native way of both altering the list and obtaining the items removed. I was thinking about how to do a single list comprehension (instead of my solution with two). But no cigar. List comprehensions are typically extremely fast. The reason @thefourtheye's solution is faster, despite using `list.append()`, is because there is only one pass. – Pierre D Sep 21 '22 at 20:44
  • Oh, ha, I just realized your solution also depends on the indices being sorted, otherwise your result has wrong order. – Kelly Bundy Sep 21 '22 at 20:47
  • 1
    Actually for `deleted`, you should just iterate over the given indices. That's faster and supports unordered indices. – Kelly Bundy Sep 21 '22 at 20:53
  • Also, since they care about the order, the input should really be a list, not a set. And the conversion to set should be included in the time. – Kelly Bundy Sep 21 '22 at 20:57
  • The tio site uses old versions of python. With python 3.10.7, the pop version is three times slower. – ekhumoro Sep 21 '22 at 21:01
  • @ekhumoro Three times slower instead of three times faster? That's very surprising. I wonder what results you'd get on your system with TIO's Python version. Anyway, my `Kelly_2` function should be much faster in this benchmark :-) – Kelly Bundy Sep 21 '22 at 21:15
  • TBH, the timings on TIO seem completely bogus. I just tested with python-3.8 and 3.9, and the pop version is still much slower (by about 50%). – ekhumoro Sep 21 '22 at 21:41
  • @ekhumoro Thanks for testing. I also tried on ideone (pop version takes twice as long, Py 3.7.3) and replit (pop version takes half as long, Py 3.8.12). Might be the most system-dependent thing I've ever seen in Python. – Kelly Bundy Sep 21 '22 at 22:26
  • @KellyBundy: thx for your comment earlier that for `deleted`, you can just iterate over the indices (duh! Of course!) The result is `f1`, which is so far the fastest of the lot. Now, I have some different assumptions than you: I think the problem ought to be: `indices_to_remove`'s order is irrelevant (so it can be a set, thanks goodness), and the order of `deleted`, as well as the modified `tokens`, are both ordered as per the original. – Pierre D Sep 21 '22 at 23:20
  • Well, my "assumptions" are really their replies to my questions (in the comments under the question). – Kelly Bundy Sep 21 '22 at 23:37
  • Actually, I stand corrected. Your version blows everything out of the water, at least for certain parameters. – Pierre D Sep 21 '22 at 23:37
  • re. assumptions: I question the precision of the answers you received to your questions. At the very least, the question should be amended to showcase what is desired if `indices_to_remove` is not monotonically increasing. – Pierre D Sep 21 '22 at 23:39
  • Thanks. Could you include `Kelly_2` as well? It was 10 times faster than `Kelly_1` in my testing (with your original benchmark, i.e., n=1e6). – Kelly Bundy Sep 21 '22 at 23:41
  • Hmm, as I see those comments, they confirmed that `indices_to_remove` *is* monotonically increasing. – Kelly Bundy Sep 21 '22 at 23:46
  • Why `Kelly_1` ramps up significantly at large n: With your test cases, it's O(n^1.5) whereas the other solutions are O(n). – Kelly Bundy Sep 21 '22 at 23:51
1

Sort the indices list in descending order. If you remove the biggest index first, it can't possibly affect the index of elements earlier in the sequence.

deleted_records = []
tokens = ['hi', 'what', 'is', 'there']
indices_to_remove = [2,3]
for index in sorted(indices_to_remove, reverse=True): 
    deleted_tokens.append(tokens[index])
    del tokens[index]
Silvio Mayolo
  • 62,821
  • 6
  • 74
  • 116
1
deleted_records = []
tokens = ['hi', 'what', 'is', 'there']
del_tokens = tokens+[]
indices_to_remove = [2,3]
for index in indices_to_remove: 
    deleted_records.append(tokens[index])
    del_tokens[index]=''
del_tokens_real = []
for t in del_tokens:
    if t!='':
       del_tokens_real.append(t)
print(del_tokens_real)
print(deleted_records)

First replace records should delete with '' and then filter other records from it to get the output. Otherwise as the list size change during the delete issues can occur.

YJR
  • 1,099
  • 3
  • 13
0

Two solutions...

This one pops the values from back to front (relies on indices_to_remove being sorted):

def Kelly_1(tokens, indices_to_remove):
    return [*map(tokens.pop, reversed(indices_to_remove))][::-1], tokens

This one gets the requested values and then copies the other values into a separate list using slices. Supports indices_to_remove being unsorted. If it's sorted, the sorted call can be removed.

def Kelly_2(tokens, indices_to_remove):
    removed = [tokens[i] for i in indices_to_remove]
    retained = []
    start = 0
    for i in sorted(indices_to_remove):
        retained += tokens[start:i]
        start = i + 1
    retained += tokens[start:]
    return removed, retained
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • 1
    The performance of the second one depends heavily on the length of the indices list (and the disribution of its values). – ekhumoro Sep 21 '22 at 22:12
  • @ekhumoro Both do. I pretty much wrote them targeting Pierre's benchmark and similarly sparse removals. And it's precisely the reason I didn't say anything about speed here, even though the second one is very fast in Pierre's benchmark. Usually I would, and usually I write benchmarks myself. A lot. I considered asking the OP for such details about the inputs, but I already had to ask twice just for the sortedness, so I didn't feel like it... – Kelly Bundy Sep 21 '22 at 22:34
  • @ekhumoro How does it depend on the *distribution*? The first one does, but the second one? – Kelly Bundy Sep 22 '22 at 16:12
  • @ekhumoro `itemgetter` seems faster still, see results in the [Output](https://tio.run/##XZHdasMwDIXv/RS6KbFDyBJKyxjkSUoJbqN2gvgHW73Y02e2k2xjNzbSOdb5hP0Xfzp7fPdhWR7BGWAySAxkvAsMAT1qFiIivzwMUFWVKLag7ZSuzRa18TOuivMYNLuwa09kYjQN5DMVjGE1pjqwc3P8H6ZT0EyRZQp5ouy7uj4rJchOdMeYxDVO6gb6ruuUeN5btPqWWkpkQnF3UzHm4qIvdIVHBgKysE25iktttJe6HceNcBybXVW7/AO/skmt/noK5O9ast6l5CskbeRAXqo2@pl4JosxIYoMM2aYdcGj@hBQCDN37hf@3ATgtIYhKzeArDRQvqMB@zI3DMNpxxv6Til4g1N56QNZltXhOMErQgUHkFz3eE4r5ClK7Ba1LN8) section. – Kelly Bundy Sep 22 '22 at 23:05
  • Yeah, I discounted any approach that requires imports. Anyway, the differences between the other approaches don't seem statistically significant, so it's probably best to just stick with what's most readable. – ekhumoro Sep 23 '22 at 10:56