3

Possible Duplicate:
Python: simple list merging based on intersections

I have a multiple list:

 list=[[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

Is there a smart and fast way to get all the sublists with at least one intersection. In my example I want that the code return

 result=[[1,2,3,5,6],[8,9,10],[11,12,13]]
Community
  • 1
  • 1
Brian
  • 13,996
  • 19
  • 70
  • 94
  • As the other question didn't have a good answer, I don't think this should be closed yet! – Katriel Feb 19 '12 at 22:54
  • @katrielalex: On what basis are you saying that? "A better data structure could be of help", yes I agree on that, and you're free to add your solution too. – Rik Poggi Feb 19 '12 at 23:11
  • @Rik: looking through the comments, it seemed that all the answers were not-working in one way or another. I only read them briefly though! – Katriel Feb 19 '12 at 23:16
  • @katrielalex: That's because we all encountered the same bug Sven Marnach seems to have. But we also **fixed that**, that's why there are a lot of comments. Basically the problem is that you may have two or more lists "linked" by a non-common element from another list. – Rik Poggi Feb 19 '12 at 23:21
  • @RikPoggi -- so you have. I can't retract the bounty; I'll pick which answer to award it to in a week! – Katriel Feb 19 '12 at 23:27

3 Answers3

0

This works, but maybe isn't very elegant:

def merge_lists(l):
        s=map(set, l)
        i, n=0, len(s)
        while i < n-1:
                for j in xrange(i+1, n):
                        if s[i].intersection(s[j]):
                                s[i].update(s[j])
                                del s[j]
                                n-=1
                                break
                else:
                        i+=1
        return [sorted(i) for i in s]
hochl
  • 12,524
  • 10
  • 53
  • 87
  • Test your code with this data: `[[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], [], [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]]` and check it against one of the solution in the duplicate question. The result should not be the same. This problem was already encountered there when the merging is not trivial. – Rik Poggi Feb 19 '12 at 23:24
  • 1
    Ok fixed. Not that it matters since the thread was closed anyways ... – hochl Feb 20 '12 at 00:11
0

You can do this with essentially a single pass through the data:

list_of_lists = [[1, 2, 3], [3, 5, 6], [8, 9, 10], [11, 12, 13]]
sets = {}
for lst in list_of_lists:
    s = set(lst)
    t = set()
    for x in s:
        if x in sets:
            t.update(sets[x])
        else:
            sets[x] = s
    for y in t:
        sets[y] = s
    s.update(t)
ids = set()
for s in sets.itervalues():
    if id(s) not in ids:
        ids.add(id(s))
        print s

This creates a dictionary sets mapping each element to the set it belongs to. If some element has been seen before, its set is subsumed into the current one and all dictinary entries mapping to the former set are updated.

Finally we need to find all unique sets in the values of the dictionary sets. Note that while I implemented this as a dictionary of sets, the code also works with lists instead of sets.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • I'm not entirely convinced by the `id`s hack -- are you sure it works in all cases? (I haven't quite grokked your algorithm yet.) – Katriel Feb 19 '12 at 22:53
  • You can do the same thing (but slower, probably) with `set(map(frozenset, sets.itervalues()))`. – Katriel Feb 19 '12 at 22:54
  • Test your code with this data: `[[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], [], [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]]` and check it against one of the solution in the duplicate question. The result should not be the same. This problem was already encountered there when the merging is not trivial. – Rik Poggi Feb 19 '12 at 22:54
  • @katrielalex: I don't see any problem with `id()` here, though I also thought about a different solution. Your suggestion would create a set of its own for every element, though, and compares them element-wise when inserting them into the set. Printing the result would be much slower than computing it... – Sven Marnach Feb 19 '12 at 23:01
  • @SvenMarnach: yep, true. It just feels... icky. If you see what I mean. – Katriel Feb 19 '12 at 23:04
  • 1
    @RikPoggi: I couldn't find any fundamental problem with my code. Your data contains a list wich contains `94` twice -- that was not covered by the answer, nor did it seem to be required by the question. (I made the trivial change to include that case.) – Sven Marnach Feb 19 '12 at 23:56
  • @SvenMarnach: Seeing that it didn't work on that data test I assumed it was for the same reason, *my apologies*. – Rik Poggi Feb 20 '12 at 00:07
  • That's actually many orders of magnitude slower than all the other approaches so far. The reason for that is although you save an additional pass through the data, you introduce a lot of overhead for managing the dictionaries and building up the final result. – Niklas B. Feb 21 '12 at 17:48
  • @NiklasB.: The code didn't receive any review, and there are a few things that can trivially be improved. I didn't bother to do so because there were enough good answers, and this question has been closed anyway. Claiming that my implementation is "many [orders of magnitude](http://en.wikipedia.org/wiki/Order_of_magnitude) slower than all the other approaches" is bold, though. I [timed it against your implementation](https://gist.github.com/1879733) on some random data and found that the difference is less than a factor of two. – Sven Marnach Feb 21 '12 at 23:33
  • @Sven Marnach: Really? I timed it too and the difference were actually more than 2 orders of magnitude (x100 to x1000 and even more on even larger input). Agreed though that this is not relevant any longer. – Niklas B. Feb 21 '12 at 23:42
  • @NiklasB.: Just run the code. The first number it prints is the time in seconds my implementation took, the second number is yours. A few simple micro-optimisations brought the factor down to 1.25 for me. – Sven Marnach Feb 21 '12 at 23:45
  • @Sven Marnach: That other benchmark uses a lot more lists. If you want, I can integrate this solution with optimizations. If not, no problem at all :) – Niklas B. Feb 21 '12 at 23:47
  • @NiklasB.: I don't really understand what you are suggesting. Which benchmark is "that other benchmark"? And are you suggesting to improve my code? If yes, feel free to do so! (BTW, when taking only the first half of the test data, my implementation is quite a bit faster than yours and hochl's.) – Sven Marnach Feb 22 '12 at 00:06
  • @SvenMarnach: Sorry, I was talking about [my benchmark](http://stackoverflow.com/a/9112588/916657) to that duplicate question, always getting confused here. With 5000 lists (5 classes), your function is about 500x slower than Rik's solution over there. Together with your observation that it performs better with less lists, this makes me wonder if it actually is `O(n^2)` rather than `O(n)`. Why would that be? Don't bother to answer if you don't have the time for this. – Niklas B. Feb 22 '12 at 00:18
  • @NiklasB.: The complexities of these algortihms cannot be easily described as `O(n)` etc, because it is quite unclear what `n` is. After thinking about it for a bit, I found that all implementation can show some kind of "quadratic" behaviour - mine when there are many sets that get joined in only a few classes, the other solutions if there are many sets that don't get joined. – Sven Marnach Feb 22 '12 at 00:25
  • @Sven Marnach: Yeah, I guess that makes sense. – Niklas B. Feb 22 '12 at 00:32
  • Since you have a benchmark, how does my solution fare in comparison? Just out of interest. – hochl Feb 22 '12 at 09:36
  • @hochl: I didn't do extensive tests. Basically your code belongs to the group of implementations that get slow if there are many classes in the final solution. We would need a proper [disjoint-set data structure](http://en.wikipedia.org/wiki/Disjoint-set_data_structure) to implement this efficiently in all cases. Both Niklas' and my benchmark are linked above -- it will be really easy to do your own tests if you want. – Sven Marnach Feb 22 '12 at 12:51
0

Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.

def pairs(lst):
    i = iter(lst)
    first = prev = item = i.next()
    for item in i:
        yield prev, item
        prev = item
    yield item, first

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx
g = networkx.Graph()
for sub_list in lists:
    for edge in pairs(sub_list):
            g.add_edge(*edge)

networkx.connected_components(g)
[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]

Explanation

We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.

Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.

Community
  • 1
  • 1
Katriel
  • 120,462
  • 19
  • 136
  • 170
  • 2
    I would suggest moving your answer to the other question. – agf Feb 19 '12 at 22:58
  • I tested your answer with this data: `[[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], [], [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]]` and your result seems to be missing a `16`. – Rik Poggi Feb 19 '12 at 23:06
  • 2
    @Rik: nice point -- there was a bug in the `pairs` function I copied from the linked question! =p Fixed. – Katriel Feb 19 '12 at 23:12