1

I have a list of items. I also have another list of items (subset of original list) I want to be removed from this list.

myitems = [1, 1, 2, 3, 3, 4, 5]
items_to_remove = [1, 4]

The output of this should be [2, 3, 3, 5]

What is the most efficient way all items from items_to_remove from myitems?

My current code is:

for item in items_to_remove:
    myitems = list(filter((item).__ne__,myitems)

Because my actual use case has lots of items to be removed I am trying to find a more efficient way to do this.

Bijan
  • 7,737
  • 18
  • 89
  • 149
  • Does ``items_to_remove`` have to be a ``list``? Can it be turned into a ``set``? Does ``myitems`` need to be modified inplace? – MisterMiyagi Sep 15 '20 at 20:24
  • 1
    already answered in https://stackoverflow.com/questions/3428536/python-list-subtraction-operation – Fabricio Fonseca Sep 15 '20 at 20:30
  • Actually, the duplicate answers pretty much ignore performance, which I found surprising. They're fine if the lists are short, but if both lists are long, then those answers are unacceptable. See my answer for the most efficient solution when both lists are long. – Tom Karzes Sep 15 '20 at 21:03
  • @TomKarzes There are various performance considerations in that dupe. Specifically [this answer](https://stackoverflow.com/a/48038480/5349916) with extensive timings, including the very same approach shown in your answer. – MisterMiyagi Sep 15 '20 at 21:09
  • 2
    @MisterMiyagi Yes, you're right. I guess my point was that the answers with the most votes (by a considerable margin) ignored performance). – Tom Karzes Sep 15 '20 at 21:10
  • @TomKarzes And now we have *yet another* not-highest-voted answer that foregoes actual performance considerations, and is even more difficult to find. – MisterMiyagi Sep 15 '20 at 21:13
  • @MisterMiyagi Are you talking about my answer? That's about as fast as it can be done for large lists. Also, this question differs from the others in that it specifically asks for an efficient solution, while the others were just looking for any solution. This question specifically asks for an efficient solution for large lists. It's a different question. – Tom Karzes Sep 15 '20 at 21:17
  • @TomKarzes The other question specifically asks "What is a pythonic *and efficient way* of doing this?". Emphasis mine. – MisterMiyagi Sep 15 '20 at 21:22
  • @MisterMiyagi True, although it doesn't specify that large lists are the focus, as opposed to many instances with short lists (for which the non-set solution could actually be faster). If one of the lists is very short (e.g. 2-3 elements), and if the operation is being performed many times on different short lists, then creating a set would actually slow it down. But for large lists, creating a set makes a tremendous difference. – Tom Karzes Sep 15 '20 at 21:28
  • @MisterMiyagi: For my application, it doesn't matter if it's a set or list because `items_to_remove` is unique. – Bijan Sep 15 '20 at 22:48

1 Answers1

2

The most efficient way is to create a set of the items to be removed, then use the set to filter the first list:

s = set(items_to_remove)
result = [x for x in myitems if x not in s]

With the sample list values, this produces the desired result:

[2, 3, 3, 5]

This solution has O(l1+l2) time complexity, where l1 and l2 are the two list lengths.

Note that some of the answers in the duplicate posts skipped the set creation, and just tested for membership directly in the second list. While correct, this has a serious negative impact on performance if the second list is large, with the performance being O(l1*l2) where l1 and l2 are the two list lengths. So unless the second list is very small, you definitely want to convert it to a set first.

Tom Karzes
  • 22,815
  • 2
  • 22
  • 41
  • It might be worth pointing out some practical performance considerations. On the one hand, ``list`` membership testing is O(n) but uses only ``__eq__``, whereas ``set`` is O(1) but uses both ``__hash__`` and ``__eq__`` – this can be relevant for userdefined types with custom ``__hash__`` and ``__eq__``. On the other hand, builtin types (specifically "small" integers) have extremely efficient ``__hash__``/``__eq__`` implementations, making the ``set`` superior for just two items already. – MisterMiyagi Sep 15 '20 at 21:32
  • @MisterMiyagi Yes, for small lists secondary considerations can become quite important. But for large lists, the membership test will dominate the performance, for which you want some kind of hashing scheme, such as sets provide. The focus of this answer is for large lists. – Tom Karzes Sep 15 '20 at 21:34