0

Let's say we have a list of N lists. For example:

L = [['A','B','C','D','E'], ['A','B','C'],['B','C','D'],['C','D'],['A','C','D']]

I want to find the longest common subsets that occur in this list and the corresponding counts. In this case:

ans = {'A,B,C':2, 'A,C,D':2, 'B,C,D':2}

I think this question is similar to mine, but I am having a hard time understanding the C# code.

Alex
  • 978
  • 1
  • 9
  • 22
  • 1
    If you have a very large number of lists with many elements, you might want to look at [this related question](https://stackoverflow.com/questions/31735817/algorithm-data-structure-for-largest-set-intersection-in-a-collection-of-sets), which talks about parallelization and other optimizations for this problem. – kcsquared Feb 17 '22 at 17:54

1 Answers1

1

I assume that a "common subset" is a set that is a subset of at least two lists in the array.

With that in mind, here's one solution.

from itertools import combinations
from collections import Counter
L = [['A','B','C','D','E'], ['A','B','C'],['B','C','D'],['C','D'],['A','C','D']]

L = [*map(frozenset,L)]
sets = [l1&l2 for l1,l2 in combinations(L,2)]
maxlen = max(len(s) for s in sets)
sets = [s for s in sets if len(s) == maxlen]
count = Counter(s for s in sets for l in L if s <= l)
dic = {','.join(s):k for s,k in count.items()}

Resulting dictionary dic:

{'A,B,C': 2, 'B,C,D': 2, 'A,C,D': 2}
Ben Grossmann
  • 4,387
  • 1
  • 12
  • 16
  • Thank you! Would this scale if I have ~200K lists, each of length 100? making all the combinations explicitly might pose an issue, right? And do you mind commenting why you used frozenset? – Alex Feb 17 '22 at 16:09
  • 1
    Making all the subsets is O(n^2); I'm not sure what that means concretely for ~200K lists. I would say just try it and see if it's taking too long. One optimization in that initial step is to keep track of the max-length subset so that we can skip combinations involving elements like `['C','D']` that are too short to consider – Ben Grossmann Feb 17 '22 at 16:12
  • 1
    Regarding frozensets: sets aren't hashable, so you can't use them as keys for dictionary objects like a Counter. – Ben Grossmann Feb 17 '22 at 16:13
  • PS: Regarding that "optimization", keeping track of the max-length really just amounts to filtering out any elements of L smaller than the second-to-largest element. – Ben Grossmann Feb 17 '22 at 16:21
  • You should precompute `frozenset(l1)` for each `l1`. Right now, that step takes `O(m*L^2)` time, where L is the number of lists and `m` is the max size of a list. – kcsquared Feb 17 '22 at 17:23
  • @kcsquared Do you mind adding the code? I am not sure I understand where/how to precompute L1. it seems to me the only way to get it is to go through combinations(L,2) – Alex Feb 17 '22 at 17:35
  • @Alex Added, see my edit. – Ben Grossmann Feb 17 '22 at 17:39
  • `len_thr = sorted(len(l) for l in L)[-2]` -- I don't think this is correct. If you append another list to L like `['F', 'G', 'H', 'I']`, the resulting dictionary is `{'': 2}` – kcsquared Feb 17 '22 at 17:56
  • @kcsquared You're right; I'll just get rid of that then – Ben Grossmann Feb 17 '22 at 18:01