Keeping track of processed lists

Question

I have a workflow where I process lists, and keep track of which ones I've seen in a dictionary - because when I get to one that's been seen already, I can pass on it. The caveat is that two lists with the same elements ordered differently are considered duplicates. So to take a simple example of integer lists: [3, 1, 5, 6] and [5, 3, 6, 1] are equivalent.

How I've been doing this is, take tuple(sorted(L)) and put that in my dictionary. So for the above example seen looks like:

seen = {..., (1, 3, 5, 6): 1, ...}

This has the nice advantage of being constant time lookup for each list I potentially need to process. The problem however is that in order to check if a given list is in seen, I have to do tuple(sorted(L)) on it. And it turns out that for large amounts of data, this becomes prohibitive to the point of taking over 50% of the total time of my entire process.

I'd like to make use of collections.Counter somehow - because Counter[3, 1, 5, 6] and Counter[5, 3, 6, 1] would evaluate equal. But Counter objects can't be used as dict keys. How can I keep my dictionary lookup, but without doing the sorting operation indicated above? Thanks in advance.

EDIT: In looking at the suggestions to use frozenset I realized I left an important thing out, which is that elements can duplicate. So while the two lists above compare equal, [3, 1, 5, 6] and [3, 3, 1, 5, 6, 6] need to be seen as distinct. Since the frozenset operation would remove duplicates, it isn't an option here.

Not sure about this at all, but would `frozenset` reduce time consumption? Again, I have no idea about implementation details. — j1-lee, Oct 16 '21 at 05:43
@j1-lee I suspect it's the sorting that ends up costing time, it's memory allocation, and an over-linear time complexity. — Vatine, Oct 16 '21 at 06:00
I don't think setting tuples as dict keys is a good idea. Maybe convert them into a string instead? That way, differently ordered tuple strings would be considered non-duplicates. — srdg, Oct 16 '21 at 06:18
Why don't you add a flag to your dictionary to check if you've already processed your list or not? `{"list1": {"list": [1, 2, 3, 4], "seen": False}, "list2": {"list": [1, 2, 3, 4], "seen": True}}` — Shmack, Oct 16 '21 at 07:46
@j1-lee 's suggestion is good - instead of using `tuple(sorted(l))` as the key, use `frozenset(l)`. — Stef, Oct 16 '21 at 10:44
That would work great but I neglected to mention an important requirement which I put in an edit to my post. ```frozenset``` would remove duplicates, which I need to keep. Sorry that I got wrapped up in writing the post and gave a poor example. — Pat Jones, Oct 16 '21 at 16:02
Looks like you overlooked my question: How large are your lists and Counters? That would be good to know in order to know what's the appropriate way forward. In particular, I asked that in order to know *how much* duplication you have. *That* you have duplication was already clear from you considering Counter instead of set. — no comment, Oct 16 '21 at 17:26
@don'ttalkjustcode Yes of course, that's a consideration. In the most extreme scenario I could have somewhere around 300,000 lists with anywhere from a few to many dozen elements. It's hard to be more precise because I don't know the full set at the start of execution. But, those are the order of magnitude numbers. — Pat Jones, Oct 16 '21 at 22:42
@PatJones That's helpful, and would be good to include in the question. Although still unclear how large the Counters are, i.e., how much duplication there is. — no comment, Oct 16 '21 at 23:31

score 0 · Answer 1 · answered Oct 16 '21 at 12:21

0

As others have pointed out, frozenset is your best option:

seen = set()

list_1, list_2, list_3 = [3, 1, 5, 6], [5, 3, 6, 1], [1, 2, 3]

print(frozenset(list_1) in seen)
seen.add(frozenset(list_1))
print(frozenset(list_1) in seen)
print(frozenset(list_2) in seen)
print(frozenset(list_3) in seen)

frozenset takes as an argument an iterable so instantiating the set would take O(n). Lookups have on average O(1), same as standard set, according to this previous response. Either way, you would save the very expensive operation of sort which is O(n log n).

answered Oct 16 '21 at 12:21

bvan

169
4

thanks for the suggestion. That would work great but I neglected to mention an important requirement which I put in an edit to my post. ```frozenset``` would remove duplicates, which I need to keep. – Pat Jones Oct 16 '21 at 16:00
That sounds wrong, you can't really consider the set size non-constant one time and then constant another time. – no comment Oct 16 '21 at 17:37

score 0 · Answer 2 · answered Oct 16 '21 at 23:33

0

If you do have a lot of duplicates, then maybe use frozenset of Counter items?

>>> from collections import Counter
>>> frozenset(Counter([3, 3, 1, 5, 6, 6]).items())
frozenset({(1, 1), (3, 2), (6, 2), (5, 1)})

Or sorted tuples of Counter items:

>>> tuple(sorted(Counter([3, 3, 1, 5, 6, 6]).items()))
((1, 1), (3, 2), (5, 1), (6, 2))

answered Oct 16 '21 at 23:33

no comment

6,381
4
12
30

Both of these actually increase my execution time substantially. I don't know enough about ```frozenset``` and ```Counter``` to say if that's surprising. As it turns out, my larger process has inefficiencies needing to be addressed, so what I'm trying to do here might be moot but is still an interesting question regardless. – Pat Jones Oct 18 '21 at 16:20

Keeping track of processed lists

2 Answers2