0

I'm using a function found here to connect lists with at least one common element:

def merge_connected_sublists(list_of_sublists: list):

    def to_edges(sublist):
        """
            treat `sublist` as a Graph and returns it's edges
            to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
        """
        it = iter(sublist)
        last = next(it)

        for current in it:
            yield last, current
            last = current

    def to_graph(list_of_sublists):
        graph = networkx.Graph()
        for sublist in list_of_sublists:
            # each sublist is a bunch of nodes
            graph.add_nodes_from(sublist)
            # it also implies a number of edges:
            graph.add_edges_from(to_edges(sublist))
        return graph

    graph = to_graph(list_of_sublists)
    list_of_merged_sublists = list(connected_components(graph))
    return list_of_merged_sublists

It works pretty good, but I found my dataset contains lists with some "wrong" common elements. Here is a simple example (in my real dataset each element of the lists is an alphanumeric string of 6 to 13 chars):

[
    [1, 2, 3, 4, "a"],
    [2, 4, 5, 6, 7],
    [4, 5, 6, 7, 8, 9],
    ["a", "b", "c", "d"],
    ["b", "c", "d", "f", "g"],
    ["a", "c", "e", "f", "g", "h"]
]

The desired output would be:

[
    [1, 2, 3, 4, 5, 6, 7, 8, 9],
    ["a", "b", "c", "d", "e", "f", "g", "h"]
]

In this example, however, the first sublist contains the element a which makes connected_components merge all of the sublists into a single one.

[
    [1, 2, 3, 4, 5, 6, 7, 8, 9, "a", "b", "c", "d", "e", "f", "g", "h"]
]

My dataset contains lists with several common elements, so I was wondering if there's a way to only merge sublists with at least N common elements. With N=2 I'd have such an output:

[
    [1, 2, 3, 4, 5, 6, 7, 8, 9, "a"],
    ["a", "b", "c", "d", "e", "f", "g", "h"],
]

A second step could be the deletion of the elements with no connections within the merged original lists. In this example, a has no connections within the first 3 lists (which will be merged later), so could be removed.

Could you help me performing such a cleanup of my dataset? Thanks!

masavini
  • 119
  • 1
  • 3
  • 10
  • I think from a graph perspective you are looking for "bridges" and 2-edge connectivity (or more general N-edge connectivity). But this will probably not result in your desired output as e.g. `9` is also only contained 1 time. – Sparky05 Jan 27 '22 at 11:20
  • @Sparky05 Thanks for your reply. Any hint or link on how to implements "bridges" and 2-edge connectivity? I'm a total newbie... – masavini Jan 27 '22 at 11:56
  • 1
    `nx.bridges(graph)` yields a list of edges, whose removal results in disconnected components and/or nodes. You probably want to retain bridges whose removal results in single unconnected nodes, so you would have to filter the results accordingly. – Paul Brodersen Jan 27 '22 at 14:49

0 Answers0