0

I am writing a function that takes a large list of strings and classifies them.

Each string classified is deleted from the list one after another.

Sometimes the function needs to delete a string in an arbitrary index and sometimes it needs to delete a string from the leftmost position.

The ratio between these two cases is about 2/3 of the times it deletes from the leftmost position and 1/3 of the time from an arbitrary place in the middle.

I know using list as the data structure to hold the strings is O(n) deleting the leftmost cell and O(n-i) deleting the i'th cell, since it moves all the cells to the right one step left for each deletion.

I then tried using collections.deque instead of list, but it actually took more time (about 4/3 more time) than the original list data structure, possibly because of the arbitrary place deletions.

What data structure will perform best under these assumptions? Is there any better than just using a list?

The deletions with the list data structure from leftmost cell with is using list.pop(0) and removing from arbitrary index is by value using list.remove(value).

  • 1
    show us exactly how you are using the deque – juanpa.arrivillaga Sep 01 '21 at 16:54
  • 1
    How large is the list, and how do you find those arbitrary indexes? – no comment Sep 01 '21 at 16:54
  • Just using `leftpop()` for deleting from left and `remove(index)` for deleting in arbitrary index – user183748292 Sep 01 '21 at 16:55
  • 4
    Perhaps reverse your list (and adjust the algorithm accordingly) so you delete from the right instead? – no comment Sep 01 '21 at 16:55
  • Or perhaps do lazy deletions, i.e., instead of deleting at arbitrary indexes, replace the value with `None` and delete it when you come across it later (when it's the last list element). We could help you better if you told us more about your overall procedure. – no comment Sep 01 '21 at 16:59
  • @don'ttalkjustcode ~ one million strings, the procedure can sometimes classify a string and then deduce the classification of other related strings, hence deleting them by value from the list – user183748292 Sep 01 '21 at 17:04
  • So when you delete them "by value", how do you know where they are? – no comment Sep 01 '21 at 17:06
  • @don'ttalkjustcode These are the arbitrary position deletions – user183748292 Sep 01 '21 at 17:10
  • That doesn't answer the question. My point is that slowness might rather be a result of how you find the deletion indexes instead of the actual deletions. – no comment Sep 01 '21 at 17:11
  • I just use `list.remove(value)` – user183748292 Sep 01 '21 at 17:13
  • So exactly what I thought. – no comment Sep 01 '21 at 17:13
  • Yes, the first comment was misleading. I edited the post and added this description. – user183748292 Sep 01 '21 at 17:24
  • I don't have time now and would like to know more to proceed anyway, but have a look at this [demo](https://tio.run/##pY7LCsIwEEX3@YrZSNJaSlUUKfRLxEVqkzaQF9NR8OtjqF240oXLe@dc5sQnTcEfzhFT0hgckHHKEBgXA9KaGJPQwYXrEPgVStg1ZXmCbW56iWvTMMrMmxdWun6QLcgalQsPJRawqMDfXa@wy3ixrFhE40nwBTN@BJl/TgrQjBO1vAK@OdZ7zWEDlIMz1ppZ3YIfZl78oRVDFM0Xn3yPHzpWaWoBfvik9AI). – no comment Sep 01 '21 at 17:27
  • ```Deque``` is suitable for this case. I don't understand why it's slower in your case. – Ram Sep 01 '21 at 17:28

2 Answers2

1

If you know where you exactly want to delete, you might be looking for a linked list implementation

Do note that deletion from a linked list is O(1)

However, searching for an element is O(n)

EDIT: If you instead care about values to be deleted, a set is probably what you're looking for. Do note that set is an unordered datastructure and would only allow for values to be deleted

Sai Prashanth
  • 144
  • 2
  • 11
  • 1
    Depends on how they find the deletion index. If they actually don't come from an index but from a *value* to delete, they could use a dict mapping the values to linked list nodes. Then "searching" should be O(1) as well. – no comment Sep 01 '21 at 17:01
  • but I believe that using a dict is O(logn(n)) tho – Sai Prashanth Sep 01 '21 at 17:02
  • 1
    No, it's O(1), except for very special cases (like [this](https://stackoverflow.com/q/69017358/16759116) but for dict). Especially for strings, as there's randomization involved so it's hard to make it slow even if you try. – no comment Sep 01 '21 at 17:04
1

Try an OrderedDict with your strings as keys.

Benchmark with just 30,000 strings, 20,000 deletions from left and 10,000 "arbitrary" removals:

1476 ms  1490 ms  1491 ms  using_list
  10 ms    11 ms    11 ms  using_OrderedDict

Instead of

strings_list.pop(0)
strings_list.remove(value)

use

strings_dict.popitem(False)
strings_dict.pop(value)

This assumes you have no duplicate strings, which seems likely. If you do, then use the dictionary values as the frequencies.

Benchmark code (Try it online!):

from timeit import timeit
from collections import OrderedDict
from random import randrange

def using_list(strings, removes):
    for string in removes:
        strings.pop(0)
        strings.pop(0)
        strings.remove(string)

def using_OrderedDict(strings, removes):
    strings = OrderedDict.fromkeys(strings)
    for string in removes:
        strings.popitem(False)
        strings.popitem(False)
        strings.pop(string)

# Build testcase with 30,000 elements
strings = list(map(str, range(10_000, 40_000)))
copy = strings.copy()
removes = []
for _ in range(10_000):
    copy.pop(0)
    copy.pop(0)
    i = randrange(len(copy))
    removes.append(copy.pop(i))

# Benchmark
for _ in range(3):
    for func in using_list, using_OrderedDict:
        times = []
        for _ in range(3):
            copy = strings.copy()
            t = timeit(lambda: func(copy, removes), number=1)
            times.append(t)
        times.sort()
        print(*('%4d ms ' % (t * 1e3) for t in times), func.__name__)
    print()
no comment
  • 6,381
  • 4
  • 12
  • 30