53
[[1, '34', '44'], [1, '40', '30', '41'], [1, '41', '40', '42'], [1, '42', '41', '43'], [1, '43', '42', '44'], [1, '44', '34', '43']]

I have a list of lists. My aim is to check whether any one sublist has anything in common with other sublists(excluding the first index object to compare). If it has anything in common then unify those sublists.

For example, for this example my final answer should be something like:

[[1, '34', '44', '40' '30', '41', '42', '43']]

I can understand that I should convert the sublists to sets and then use union() and intersection() operations. But what I am stuck with is how to compare each set/sublist. I can't run a loop over the list and compare each sublist one by one as the contents of the list would be modified and this would lead to an error.

What I want to know is there any efficient method to compare all the sublists(converted to sets) and get a union of them?

MSH
  • 1,743
  • 2
  • 14
  • 22
Tapojyoti Mandal
  • 742
  • 1
  • 5
  • 14
  • 1
    do you need the same order? – Ajay Jun 11 '15 at 07:10
  • No, there is no need to preserve order. – Tapojyoti Mandal Jun 11 '15 at 07:12
  • Can there be other lists starting with e.g. `2`, which shouldn't be combined? – Peter Wood Jun 11 '15 at 07:18
  • 2
    Do you want this?http://stackoverflow.com/a/27803361/2867928 – Mazdak Jun 11 '15 at 07:48
  • Actually, I forgot to stress on one important condition. Sorry, for my mistake. I have also mentioned that the sublists should be unified if and only if they have something in common otherwise they should remain as it is. So, first there is a need to check for intersection() which if isn't empty then only union should be done. @Peter Wood No, there will not be any sublist with separate starting index element like '2' or '3'. I mean in the list of lists all the sublist will have same first index element. – Tapojyoti Mandal Jun 11 '15 at 07:49
  • @Kasra Yes, Thanks this is exactly the same problem that I am facing. – Tapojyoti Mandal Jun 11 '15 at 08:02
  • @TapojyotiMandal Welcome! ;) – Mazdak Jun 11 '15 at 08:03

7 Answers7

83

The itertools module makes short work of this problem:

>>> from itertools import chain
>>> list(set(chain.from_iterable(d)))
[1, '41', '42', '43', '40', '34', '30', '44']

Another way to do it is to unpack the list into separate arguments for union():

>>> list(set().union(*d))
[1, '41', '42', '43', '40', '34', '30', '44']

The latter way eliminates all duplicates and doesn't require that the inputs first be converted to sets. Also, it doesn't require an import.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 1
    How does itertools generally scale? In your experience, can this kind of an operation handle tens or hundreds of millions item long lists ('items' being strings here)? Or even larger? – Hassan Baig Sep 24 '17 at 17:57
  • 3
    The ``chain.from_iterable()`` step is scale invariant. At any given time, its whole state is stored in just two pointers to iterators. The ``set()`` and ``list()`` parts eat memory in proportion to the total number of unique inputs. On my 64-bit machine, one hundred million unique inputs takes 4.3 GB of RAM for the set object and 0.9 GB for the list object. – Raymond Hettinger Sep 25 '17 at 07:22
  • 3
    You should better write `set.union()` as `set()` initializes with empty set. This is ok in this case but I spent a lot of time searching for bugs because I assumed this generalizes to intersection. With `set.` you can do both union and intersection! – Radio Controlled Jul 07 '20 at 11:49
  • 1
    @RadioControlled: `set().union(*d)` works with an empty `d`, though, which is a more important factor than symmetry with what you would do for an intersection. – user2357112 Jan 04 '21 at 12:26
  • Too bad all the answers are answering a different question from what was asked, though, probably due to the poor choice of example in the question. The question seems to have actually been asking for something closer to a hypergraph connected components algorithm, not just dumping everything into a single set. (dermen's answer is slightly different, but it ends up being even wronger.) – user2357112 Jan 04 '21 at 12:32
44

Using the unpacking operator *:

>> list(set().union(*a))
[1, '44', '30', '42', '43', '40', '41', '34']

(Thanks Raymond Hettinger and ShadowRanger for the comments!)

(Note that

set.union(*tup)

will unpack to

set.union(tup[0], tup[1], ... tup[n - 1])

)

Skippy le Grand Gourou
  • 6,976
  • 4
  • 60
  • 76
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • This method works fine. Thanks. Could you explain what is the use of '*' in the code? Or maybe provide a link where I could study related to this to understand more. – Tapojyoti Mandal Jun 11 '15 at 07:38
  • 3
    FWIW, the *tuple* step has no effect because star-unpacking works on any iterable. You could also replace the list comprehension with ``map(set, a)``. The result boils down to ``list(set.union(*map(set, a)))``. – Raymond Hettinger Jun 11 '15 at 07:50
  • 1
    @TapojyotiMandal See explanation in answer. – Ami Tavory Jun 11 '15 at 07:51
  • 5
    You could dramatically reduce the number of temporary `set`s by changing `set.union(*map(set, a))` to `set().union(*a)`. The only reason the `map(set,` was needed was because you were calling `set.union` and the first argument became the "self" it was being called on, but if you make an empty `set` as the base, `union` accepts arbitrary iterables for the remaining arguments. – ShadowRanger May 09 '19 at 04:32
1
In [20]: s
Out[20]: 
[[1, '34', '44'],
 [1, '40', '30', '41'],
 [1, '41', '40', '42'],
 [1, '42', '41', '43'],
 [1, '43', '42', '44'],
 [1, '44', '34', '43']]
In [31]: list({x for _list in s for x in _list})
Out[31]: [1, '44', '30', '42', '43', '40', '41', '34']

Update:

Thanks for the comments

Ajay
  • 5,267
  • 2
  • 23
  • 30
1

You can use itertools to perform this action. Let us assume that your list has a variable name A

import itertools

single_list_with_all_values = list(itertools.chain(*A))
single_list_with_all_values.sort()

print set(single_list_with_all_values)
Arpit Goyal
  • 2,212
  • 11
  • 31
  • 4
    This is pretty good. A few improvements though. 1) A ``chain(*it)`` should always be changed to ``chain.from_iterable(it)``. 2) There is no need to ``sort()`` because the ordering is lost in making the *set*. 3) Without the sort, there is no need to convert to a *list* before making the *set*. With those changes, it boils down to ``set(chain.from_iterable(d))``. – Raymond Hettinger Jun 11 '15 at 07:45
1
>>> big = [[1, '34', '44'], [1, '40', '30', '41'], [1, '41', '40', '42'], [1, '42', '41', '43'], [1, '43', '42', '44'], [1, '44', '34', '43']]
>>> set(reduce ( lambda l,a : l + a, big))
set([1, '44', '30', '42', '43', '40', '41', '34'])

And if you really want a list of a list as a final result

>>>>[list(set(reduce ( lambda l,a : l + a, big)))]
[[1, '44', '30', '42', '43', '40', '41', '34']]

And if you don't like recoding a lambda function for the list addition :

>>>>[list(set(reduce ( list.__add__, big)))]
[[1, '44', '30', '42', '43', '40', '41', '34']]

EDIT : after your recommendation about using itertools.chain instead of list.__add__ I ran a timeit for both with the original variable used by the original poster.

It seems that timeit times list.__add__ around 2.8s and itertools.chain around 3.5 seconds.

I checked on this page and yes, you were right with the itertools.chain contains a from_iterable method that grants a huge performance boost. see below with list.__add__, itertools.chain and itertools.chain.from_iterable.

>>> timeit.timeit("[list(set(reduce ( list.__add__, big)))]", setup="big = [ [10,20,30,40] for ele in range(10000)]", number=30)
16.051744650801993
>>> timeit.timeit("[list(set(reduce ( itertools.chain, big)))]", setup="big = [ [10,20,30,40] for ele in range(10000)]", number=30)
54.721315866467194
>>> timeit.timeit("list(set(itertools.chain.from_iterable(big)))", setup="big = [ [10,20,30,40] for ele in range(10000)]", number=30)
0.040056066849501804

Thank you very much for your advises :)

Azurtree
  • 369
  • 4
  • 11
  • 1
    Adding lists together like this is an inefficient O(n**2) operation and is almost always a bad idea. Please use ``itertools.chain`` instead. – Raymond Hettinger Jun 11 '15 at 07:51
1
from functools import reduce

out = list(reduce(set.union, iterable))

as long as at least the first the element of iterable is a set. Otherwise,

out = list(reduce(set.union, iterable[1:], set(iterable[0])))
PeterFoster
  • 319
  • 1
  • 3
  • 9
0

Tested with python 2 only: I personally like the readability of reduce, paired with a simple conditional function, something like

# PYTHON 2 ONLY!
somelists = [[1, '41', '40', '42'], [1, '42', '41', '43'], [1, '43', '42', '44'], [1, '44', '34', '43']] # your original lists
somesets = map(set,somelists) #your lists as sets

def condition(s1,s2): # condition to apply recursively to the sets
    if s1.intersection(s2):
        return s1.union(s2)
reduce( condition,somesets)
#{1, '30', '34', '40', '41', '42', '43', '44'}

Of course you can cast this result to a 2d list if you desire list([reduce(...

I will note that this is something like 3x slower than the chain.fromiterable answer.

dermen
  • 5,252
  • 4
  • 23
  • 34