Find the longest most common items in multiple lists (not substring)

Question

Let's say we have a list of N lists. For example:

L = [['A','B','C','D','E'], ['A','B','C'],['B','C','D'],['C','D'],['A','C','D']]

I want to find the longest common subsets that occur in this list and the corresponding counts. In this case:

ans = {'A,B,C':2, 'A,C,D':2, 'B,C,D':2}

I think this question is similar to mine, but I am having a hard time understanding the C# code.

If you have a very large number of lists with many elements, you might want to look at [this related question](https://stackoverflow.com/questions/31735817/algorithm-data-structure-for-largest-set-intersection-in-a-collection-of-sets), which talks about parallelization and other optimizations for this problem. — kcsquared, Feb 17 '22 at 17:54

Ben Grossmann · Accepted Answer · 2022-02-17T18:00:56.670

1

I assume that a "common subset" is a set that is a subset of at least two lists in the array.

With that in mind, here's one solution.

from itertools import combinations
from collections import Counter
L = [['A','B','C','D','E'], ['A','B','C'],['B','C','D'],['C','D'],['A','C','D']]

L = [*map(frozenset,L)]
sets = [l1&l2 for l1,l2 in combinations(L,2)]
maxlen = max(len(s) for s in sets)
sets = [s for s in sets if len(s) == maxlen]
count = Counter(s for s in sets for l in L if s <= l)
dic = {','.join(s):k for s,k in count.items()}

Resulting dictionary dic:

{'A,B,C': 2, 'B,C,D': 2, 'A,C,D': 2}

edited Feb 17 '22 at 18:00

answered Feb 17 '22 at 15:49

Ben Grossmann

4,387
1
12
16

Thank you! Would this scale if I have ~200K lists, each of length 100? making all the combinations explicitly might pose an issue, right? And do you mind commenting why you used frozenset? – Alex Feb 17 '22 at 16:09
1

Making all the subsets is O(n^2); I'm not sure what that means concretely for ~200K lists. I would say just try it and see if it's taking too long. One optimization in that initial step is to keep track of the max-length subset so that we can skip combinations involving elements like `['C','D']` that are too short to consider – Ben Grossmann Feb 17 '22 at 16:12
1

Regarding frozensets: sets aren't hashable, so you can't use them as keys for dictionary objects like a Counter. – Ben Grossmann Feb 17 '22 at 16:13
PS: Regarding that "optimization", keeping track of the max-length really just amounts to filtering out any elements of L smaller than the second-to-largest element. – Ben Grossmann Feb 17 '22 at 16:21
You should precompute `frozenset(l1)` for each `l1`. Right now, that step takes `O(m*L^2)` time, where L is the number of lists and `m` is the max size of a list. – kcsquared Feb 17 '22 at 17:23
@kcsquared Do you mind adding the code? I am not sure I understand where/how to precompute L1. it seems to me the only way to get it is to go through combinations(L,2) – Alex Feb 17 '22 at 17:35
@Alex Added, see my edit. – Ben Grossmann Feb 17 '22 at 17:39
`len_thr = sorted(len(l) for l in L)[-2]` -- I don't think this is correct. If you append another list to L like `['F', 'G', 'H', 'I']`, the resulting dictionary is `{'': 2}` – kcsquared Feb 17 '22 at 17:56
@kcsquared You're right; I'll just get rid of that then – Ben Grossmann Feb 17 '22 at 18:01

Find the longest most common items in multiple lists (not substring)

1 Answers1