What is the fastest way to join multiple subsets that have similar elements?

Question

I have a list with 500+ thousand subsets each having from 1 to 500 values (integers). So i have something like:

{1, 2, 3 }
{2, 3}
{4, 5}
{3, 6, 7}
{7, 9}
{8, 4}
{10, 11}

After running the code I would like to get:

{1, 2, 3, 6, 7, 9}
{4, 5, 8}
{10, 11}

I wrote simple code [here] that compares each subset to each subset, if they intersect they are joined together, else not. It is ok on small scale, but with big amount of data it takes forever.

Please, could you advise any improvements?

P.S. I am not strong in maths or logics, big O notation would be greek for me. I am sorry.

Possible duplicate of [Data in a list within a list](http://stackoverflow.com/questions/27802706/data-in-a-list-within-a-list) — Mazdak, Jun 29 '16 at 09:24

score 2 · Accepted Answer · answered Jun 29 '16 at 09:45

You're trying to find the connected components in a graph, with each of your input sets representing a set of nodes that's fully connected. Here's a simple implementation:

sets = [{1, 2, 3 },{2, 3},{4, 5},{3, 6, 7},{7, 9},{8, 4},{10, 11}]
allelts = set.union(*sets)
components = {X: {X} for X in allelts}
component = {X: X for X in allelts}
for S in sets:
    comp = sorted({component[X] for X in S})
    mergeto = comp[0]
    for mergefrom in comp[1:]:
        components[mergeto] |= components[mergefrom]
        for X in components[mergefrom]:
            component[X] = mergeto
        del components[mergefrom]

That results in components having a list of components (keyed by their minimum element), and component storing the components for each element:

>>> print(components)
{1: {1, 2, 3, 6, 7, 9}, 4: {8, 4, 5}, 10: {10, 11}}
>>> print(component)
{1: 1, 2: 1, 3: 1, 4: 4, 5: 4, 6: 1, 7: 1, 8: 4, 9: 1, 10: 10, 11: 10}
>>>

What is the fastest way to join multiple subsets that have similar elements?

1 Answers1