I'm using a function found here to connect lists with at least one common element:
def merge_connected_sublists(list_of_sublists: list):
def to_edges(sublist):
"""
treat `sublist` as a Graph and returns it's edges
to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
"""
it = iter(sublist)
last = next(it)
for current in it:
yield last, current
last = current
def to_graph(list_of_sublists):
graph = networkx.Graph()
for sublist in list_of_sublists:
# each sublist is a bunch of nodes
graph.add_nodes_from(sublist)
# it also implies a number of edges:
graph.add_edges_from(to_edges(sublist))
return graph
graph = to_graph(list_of_sublists)
list_of_merged_sublists = list(connected_components(graph))
return list_of_merged_sublists
It works pretty good, but I found my dataset contains lists with some "wrong" common elements. Here is a simple example (in my real dataset each element of the lists is an alphanumeric string of 6 to 13 chars):
[
[1, 2, 3, 4, "a"],
[2, 4, 5, 6, 7],
[4, 5, 6, 7, 8, 9],
["a", "b", "c", "d"],
["b", "c", "d", "f", "g"],
["a", "c", "e", "f", "g", "h"]
]
The desired output would be:
[
[1, 2, 3, 4, 5, 6, 7, 8, 9],
["a", "b", "c", "d", "e", "f", "g", "h"]
]
In this example, however, the first sublist contains the element a
which makes connected_components
merge all of the sublists into a single one.
[
[1, 2, 3, 4, 5, 6, 7, 8, 9, "a", "b", "c", "d", "e", "f", "g", "h"]
]
My dataset contains lists with several common elements, so I was wondering if there's a way to only merge sublists with at least N common elements. With N=2 I'd have such an output:
[
[1, 2, 3, 4, 5, 6, 7, 8, 9, "a"],
["a", "b", "c", "d", "e", "f", "g", "h"],
]
A second step could be the deletion of the elements with no connections within the merged original lists. In this example, a
has no connections within the first 3 lists (which will be merged later), so could be removed.
Could you help me performing such a cleanup of my dataset? Thanks!