1

I have a list with 500+ thousand subsets each having from 1 to 500 values (integers). So i have something like:

{1, 2, 3 }
{2, 3}
{4, 5}
{3, 6, 7}
{7, 9}
{8, 4}
{10, 11}

After running the code I would like to get:

{1, 2, 3, 6, 7, 9}
{4, 5, 8}
{10, 11}

I wrote simple code [here] that compares each subset to each subset, if they intersect they are joined together, else not. It is ok on small scale, but with big amount of data it takes forever.

Please, could you advise any improvements?

P.S. I am not strong in maths or logics, big O notation would be greek for me. I am sorry.

Dmitry
  • 135
  • 1
  • 7

1 Answers1

2

You're trying to find the connected components in a graph, with each of your input sets representing a set of nodes that's fully connected. Here's a simple implementation:

sets = [{1, 2, 3 },{2, 3},{4, 5},{3, 6, 7},{7, 9},{8, 4},{10, 11}]
allelts = set.union(*sets)
components = {X: {X} for X in allelts}
component = {X: X for X in allelts}
for S in sets:
    comp = sorted({component[X] for X in S})
    mergeto = comp[0]
    for mergefrom in comp[1:]:
        components[mergeto] |= components[mergefrom]
        for X in components[mergefrom]:
            component[X] = mergeto
        del components[mergefrom]

That results in components having a list of components (keyed by their minimum element), and component storing the components for each element:

>>> print(components)
{1: {1, 2, 3, 6, 7, 9}, 4: {8, 4, 5}, 10: {10, 11}}
>>> print(component)
{1: 1, 2: 1, 3: 1, 4: 4, 5: 4, 6: 1, 7: 1, 8: 4, 9: 1, 10: 10, 11: 10}
>>>