Finding the longest most frequent subset in list of lists

Question

I need an efficient Python algorithm to get to the following outcome(s):

example 1:

l = [[1, 2, 3, 4], [2, 1, 4], [3, 1], [4, 1, 2]]

r = [[1, 2, 4], [3]]

example 2:

l = [[1, 2, 3, 4], [2, 1], [3, 1], [4, 1]]

r = [[1, 2], [3], [4]]

example 3:

l = [[1], [2, 3, 4], [3, 2, 5, 6], [4, 2] , [5, 3], [6, 3]]

r = [[2, 3], [1], [4], [5], [6]]

In these examples l is a list of lists, where:

the first element of each list is always the index of the element
if element 1 contains the value 3, then in its turn the 3rd list will always contain the 1 itself.

What I need is to find an efficient way to group matching subsets using 3 criteria: length of subset, frequency of sub-elements and the fact that a sub-element can only be part of 1 resulting subset.

In the 1st example the sub-element [1] would be the most frequent subset, but the fact that the longer subset [1, 2, 4] was found in 3 of the 4 elements is more important: length trumps frequency.

In the 2nd example a grouping such as [1, 3] has the same length and frequency as [1, 2], but the the 1 is "taken" by the firstly found subset.

Later edit:

What I did so far is:

I turned my list of lists into a dictionary
then I built a function which repeatedly builds square matrixes of matched and not matched values, based on all possible permutations the unique keys of my dictionary
then in the square matrixes I search for the largest squares along the main diagonal (based on a code provided here: Python find the largest square in the matrix dynamic programming)
then I eliminate the largest squares which overlap and start allover again

My code is totally inefficient because the number of permutations grows exponentially with the size of my initial dictionary, therefore I am looking for a new idea, a new approach.

Here is what I have done so far:


from itertools import chain, permutations

def group_matches(my_dict, matched=None):

    def update_my_dict(my_dict, matched):
        ret_val = {}
        for k, v in my_dict.items():
            if k not in matched:
                for unique_ind in matched:
                    if unique_ind in v:
                        v.remove(unique_ind)
                ret_val[k] = v
        return ret_val

    def get_matches(unique_ind_permutation, my_dict):
        def create_matrix(unique_ind_permutation, my_dict):
            matrix = []
            for k in unique_ind_permutation:
                r = [True if f in my_dict[k] else False
                     for f in unique_ind_permutation]
                matrix += [r]
            return matrix
        matrix = create_matrix(unique_ind_permutation, my_dict)
        dp = [[0] * len(matrix) for _ in range(len(matrix))]
        max_squares = [(0, None, None)]
        for ri, r in enumerate(matrix):
            for ci, c in enumerate(r):
                dp[ri][ci] = 0 if not c \
                    else (1 if ri == 0 or ci == 0
                    else min(dp[ri - 1][ci], dp[ri][ci - 1], dp[ri - 1][ci - 1]) + 1)
                max_squares = [(dp[ri][ci], ri, ci)] if dp[ri][ci] > max_squares[0][0] \
                    else (max_squares + [(dp[ri][ci], ri, ci)] if dp[ri][ci] == max_squares[0][0]
                    else max_squares)
        matches = []
        if max_squares[0][0] != 0:
            for max_square in max_squares:
                rows = [r for r in range(max_square[1]+1-max_square[0],max_square[1]+1)]
                columns = [c for c in range(max_square[2]+1-max_square[0],max_square[2]+1)]
                if rows == columns:
                    matches += [tuple(rows)]
            matches = eliminate_common_matches(matches)
        matches_to_unique_ind = []
        l = 0
        if len(matches) > 0:
            l = len(matches[0])
            for m in matches:
                m_unique_ind = sorted([unique_ind_permutation[x] for x in m])
                matches_to_unique_ind += [m_unique_ind]
        return matches_to_unique_ind, l

    def eliminate_common_matches(matches):
        for m in matches:
            aux = matches.copy()
            aux.remove(m)
            for a in aux:
                common = (set(m) & set(a))
                if len(common) > 0:
                    min_m = min(m)
                    min_a = min(a)
                    if min_m <= min_a:
                        matches.remove(a)
                    else:
                        matches.remove(m)
        return matches

    def find_matched(unique_indexes, matches):
        matched = []
        unmatched = []
        for unique_ind in unique_indexes:
            for m in matches:
                if unique_ind in m:
                    matched += [unique_ind]
                else:
                    unmatched += [unique_ind]
        return matched, unmatched

    if matched is not None:
        my_dict = update_my_dict(my_dict, matched)
    unique_indexes = list(my_dict.keys())
    p_unique_indexes = list(permutations(unique_indexes))
    matches = []
    last_l = 0
    for p in p_unique_indexes:
        m, l = get_matches(p, my_dict)
        if last_l < l:
            matches.clear()
            last_l = l
        if last_l == l and l > 0:
            matches += m
    matches = set(tuple(l) for l in matches)
    matches_order = []
    for m in matches:
        mx = sorted([unique_indexes.index(unique_ind_x) for unique_ind_x in m])
        matches_order += [mx]
    matches_order = eliminate_common_matches(matches_order)
    matches = []
    for mo in matches_order:
        mx = [unique_indexes[x] for x in mo]
        matches += [mx]
    matched, unmatched = find_matched(unique_indexes, matches)
    return matches, matched, unmatched

my_dict = {1:[1, 2, 3, 4],
           2:[2, 1, 4],
           3:[3, 1],
           4:[4, 1, 2]}     
unique_indexes = list(my_dict.keys())
matches = []
matched = None
while True:
    instance_matches, matched, unmatched = group_matches(my_dict, matched)
    if len(instance_matches) > 0:
        matches += instance_matches
    if len(unmatched) == 0 or len(instance_matches) == 0:
        break
unmatched = list(set(unique_indexes) - set(list(chain(*matches))))
matches_unique = []
for i, x in enumerate(matches):
    if x not in matches[:i]:
        matches_unique += [x]
matches = matches_unique + unmatched      
print(matches)

Another more complicated example:


my_dict = {
            'a':['a', 'b', 'c', 'h'],
            'b':['b', 'a', 'c', 'i'],
            'c':['c', 'a', 'b', 'd', 'e'],
            'd':['d', 'c', 'e', 'f'],
            'e':['e', 'c', 'd', 'f'],
            'f':['f', 'd', 'e', 'g', 'h'],
            'g':['g', 'f', 'h'],
            'h':['h', 'a', 'f', 'g'],
            'i':['i',  'b']
          }
# expected outcome:
res = [['c', 'd', 'e'], ['f', 'g', 'h'], ['a', 'b'], ['i']]

The subset ['d', 'e', 'f'] is not part of the expected outcome, because the 'd' and the 'e' are already taken by the first subset.

If your inputs are always short as in your examples, there is not much of a need for "an efficient algorithm". If you expect to have large inputs, consider using something else than Python because it is not very performant. — Lenormju, Jun 18 '21 at 13:41
Yes, my inputs can get much longer. Any suggestions on what else I could use? Thanks. — Sandy B, Jun 18 '21 at 13:47
I'm working on a solution. For the first example `l = [[1, 2, 3, 4], [2, 1, 4], [3, 1], [4, 1, 2]]`, why the answer should be `r = [[1, 2, 4], [3]]` instead of `r = [[1, 2, 3, 4]]`, `[1, 2, 3, 4]` being a longer subsequence than `[1, 2, 4]` ? — Lenormju, Jun 18 '21 at 14:43
Because in the result: 1. Any sub-element must not appear more than once: a result such as [[1, 2, 3, 4]], [1, 2, 3, 4]] would mean for example that the 1, the 2 and so on are appeariong twice overall. 2. I am not interested in the longest sub-sequence, but rather in searching for the longest most frequence sub-sequence, with the condition that length trumps frequency. — Sandy B, Jun 18 '21 at 15:50

Lenormju · Accepted Answer · 2021-06-18T18:41:03.700

You can view your lists/sets problem as a graph problem, by considering sets as nodes and the values they contain as vertices, and it works because :

if element 1 contains the value 3, then in its turn the 3rd list will always contain the 1 itself.

means that all edges are bidirectional
the first element of each list is always the index of the element

means that each node is connected to itself

As an example, your first input was [[1, 2, 3, 4], [2, 1, 4], [3, 1], [4, 1, 2]], so the corresponding graph is :

┌───┐         ┌───┐
│   │         │   │
└─set 1─────set 2─┘
    │  \      │
    │   \     │
    │    \    │
    │     \   │
    │      \  │
    │       \ │
┌─set 3     set 4─┐
│   │         │   │
└───┘         └───┘

From this viewpoint, "finding the largest common subset of theses sets" can be translated to "finding the largest subgraph for which all nodes are connected to each other", which is also called the maximum clique problem.

Indeed, the maximum clique for this graph is {1, 2, 4}.

Based on this approach, here is a working solution :

from typing import List, Set, Tuple


def get_maximum_clique(edges: Set[Tuple[int, int]], vertices_number: Set[int]) -> Set[int]:
    biggest_maximal_clique_so_far = {}  # trivial solution
    # to find the maximum clique, we start by iterating over each vertex (which by itself is already a clique)
    for seed_vertex in vertices_number:
        # and we try to grow the clique by adding a new vertex to the clique iif it is connected to every vertex of the clique
        # until we exhausted every possible vertex to add
        candidate_vertices = vertices_number.difference({seed_vertex})
        clique = {seed_vertex}
        while candidate_vertices:  # is not empty
            candidate_vertex = candidate_vertices.pop()
            # if the candidate vertex is connected to all the vertices already in the clique
            if all((vertex, candidate_vertex) in edges for vertex in clique):
                # grow the clique
                clique.update({candidate_vertex})
        # we grew the maximal clique from this seed, let's check its size
        if not biggest_maximal_clique_so_far or len(clique) > len(biggest_maximal_clique_so_far):
            biggest_maximal_clique_so_far = clique

    maximum_clique = biggest_maximal_clique_so_far
    return maximum_clique


def get_largest_most_frequent_subsets(list_of_lists: List[List[int]]) -> List[List[int]]:
    # we prepare a list of all vertices of the graph
    all_vertices_number = set(connected_vertex for vertex_edges in list_of_lists for connected_vertex in vertex_edges)

    connected_vertices_of_each_vertex = list_of_lists  # simple renaming

    decreasing_maximum_cliques = []
    while any(connected_vertices_of_each_vertex):  # contains a non-empty subset i.e. a vertex with remaining edges
        # we create our graph as a set of edges of the form (v1, v2)
        edges = {
            (vertex1, vertex2)
            for vertex1, *edges in (connected_vertices for connected_vertices in connected_vertices_of_each_vertex
                                    if len(connected_vertices) >= 2)
            for vertex2 in edges
            # we could add `if vertex1 < vertex2` to halve the number of directional edges because all edges are
            # bidirectional, but I prefer not to worry with the order later
        }

        # we get the clique for this graph
        maximum_clique_remaining = get_maximum_clique(edges, all_vertices_number)
        decreasing_maximum_cliques.append(maximum_clique_remaining)
        # we remove the edges from this clique to all vertices left,
        # and all edges from other vertices to those of the clique
        connected_vertices_of_each_vertex = [
            [
                connected_vertex for connected_vertex in connected_vertices
                if connected_vertex not in maximum_clique_remaining
            ]
            if (connected_vertices and connected_vertices[0] not in maximum_clique_remaining) else []
            for connected_vertices in connected_vertices_of_each_vertex
        ]
        all_vertices_number.difference_update(maximum_clique_remaining)

    return [list(sorted(clique)) for clique in decreasing_maximum_cliques]  # serves only to match the expected output


def test__get_largest_most_frequent_subsets() -> None:
    assert get_largest_most_frequent_subsets([[1, 2, 3, 4], [2, 1, 4], [3, 1], [4, 1, 2]]) == [[1, 2, 4], [3]]
    assert get_largest_most_frequent_subsets([[1, 2, 3, 4], [2, 1], [3, 1], [4, 1]]) == [[1, 2], [3], [4]]
    assert get_largest_most_frequent_subsets([[1], [2, 3, 4], [3, 2, 5, 6], [4, 2], [5, 3], [6, 3]]) == [[2, 3], [1], [4], [5], [6]]

    my_dict = {
        'a': ['a', 'b', 'c', 'h'],
        'b': ['b', 'a', 'c', 'i'],
        'c': ['c', 'a', 'b', 'd', 'e'],
        'd': ['d', 'c', 'e', 'f'],
        'e': ['e', 'c', 'd', 'f'],
        'f': ['f', 'd', 'e', 'g', 'h'],
        'g': ['g', 'f', 'h'],
        'h': ['h', 'a', 'f', 'g'],
        'i': ['i', 'b']
    }
    my_list_of_lists = list(values for key, values in sorted(my_dict.items()))
    result = get_largest_most_frequent_subsets(my_list_of_lists)
    # there are multiple combinations of decreasing maximum cliques :
    assert (result == [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h'], ['i']]) \
           or (result == [['a', 'b', 'c'], ['f', 'g', 'h'], ['d', 'e'], ['i']]) \
           or (result == [['c', 'd', 'e'], ['f', 'g', 'h'], ['a', 'b'], ['i']]) \
           or (result == [['c', 'd', 'e'], ['f', 'g', 'h'], ['b', 'i'], ['a']]) \
           or (result == [['d', 'e', 'f'], ['a', 'b', 'c'], ['g', 'h'], ['i']]) \
           or (result == [['f', 'g', 'h'], ['a', 'b', 'c'], ['d', 'e'], ['i']]) \
           or (result == [['f', 'g', 'h'], ['c', 'd', 'e'], ['b', 'i'], ['a']]), repr(result)


if __name__ == "__main__":
    test__get_largest_most_frequent_subsets()

My code passes your 4 tests, and fast :

if __name__ == "__main__":
    import timeit
    print(timeit.timeit("test__get_largest_most_frequent_subsets()", number=1000, globals=locals()))
    # 0.1920557

If it is not fast enough for you (you did not ask for a specific duration), there may be ways to optimize this code. I did not try to profile it.
I implemented a greedy algorithm, but there are some clever algorithms listed on the Wikipedia page that have a lower computional complexity, but I did not try to implement them. There already exists some performant implementations for the maximum clique(s) problem, for example this answer indicates there exist one in non-parallel C++/Boost of the weighted maximum clique (of which your problem is a special case, all weights being equal).
But I don't think there already exist a performant version of what you seek (which I dubbed "decreasing maximum cliques").

In my real case implementation so far the number of keys for **my_dict** hasn't exceeded 10. This is therefore perfect for what I needed. Thanks a lot for your pointers and the algorithm! — Sandy B, Jun 20 '21 at 11:22
@SandyB It took me a while before it clicked in my mind that it actually could be viewed as a graph problem, which often already have solutions published on Internet. It took me a few hours of work total, but I really enjoyed it ! I'm glad that is perfectly fits your needs :) — Lenormju, Jun 20 '21 at 12:52

Finding the longest most frequent subset in list of lists

1 Answers1