Collapse list of lists to eliminate redundancy

Question

I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:

>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]

So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?

My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.

I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?

Just to be clear: If any two lists contain any matching elements, you want to merge their elements? Do you have duplicate elements, e.g. `[1, 2, 2, 3]` and [`1, 5]`? Generally you should be looking at using tuples and sets rather than lists of lists for this kind of thing — YXD, Jun 18 '15 at 13:53
I also have the same question with YXD - what sort of redundancy you are trying to remove? — user1941126, Jun 18 '15 at 14:01
@YXD - To your first question, yes. If my example also had a `[1, 5]` list in it, then all the elements should end up in one list. In the end, I basically want a series of sets (no duplicates anywhere), but my initially data might have duplicates within a list. Perhaps I should explain what my data actually looks like/where it came from... will post an edit in a sec. — kevbonham, Jun 18 '15 at 14:28

score 2 · Accepted Answer · edited May 23 '17 at 11:51

I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.

Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python

data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]

result = []

for d in data:
    d = set(d)

    matched = [d]
    unmatched = []
    # first divide into matching and non-matching groups
    for g in result:
        if d & g:
            matched.append(g)
        else:
            unmatched.append(g)
    # then combine all matching groups into one group
    # while leaving unmatched groups intact
    result = unmatched + [set().union(*matched)]

print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]

We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.

The union of all the sets in matched is computed by set().union(*matched). For reference:

This does what I want. I added `[11], [11, 12], [13]` to the end of `data` and it generates two additional sets, but if I also add `[12, 5]`, it puts `11` and `12` into the set with `[5, 6, 7]`. Exactly what I wanted... Not sure I understand all of the steps, but I think I can puzzle it out now. Thanks! — kevbonham, Jun 18 '15 at 14:40
Explanation helped a lot, but was still struggling a bit with what `set().union(*matched)` was doing. Then I printed `matched` and `unmatched` for each loop as well, and all became clear! Thanks again for the help! — kevbonham, Jun 18 '15 at 15:01

score 2 · Answer 2 · edited May 23 '17 at 11:44

I assume that you want to merge lists that contain any common element.

Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)

import functools #python 2.5+
def seematch(X,Y):
    return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))

it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here: Stopping a Reduce() operation mid way. Functional way of doing partial running sum

I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.

Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.

I know this is an incomplete answer - hope that helped anyway!

Searching for the answers to a bunch of my issues lately has yielded results with lambda functions and `reduce` (also `map`) that I don't understand at all. Clearly I need to go back and do some rudimentary reading on this subject - the programming knowledge I've cobbled together is clearly incomplete. — kevbonham, Jun 18 '15 at 15:07

Collapse list of lists to eliminate redundancy

2 Answers2