Easy & efficient way to get intersection of the 2nd elements for several lists of tuples in Python?

Question

I'm new to Python (working in 2.7), and I am finding SO to be a very valuable resource!

Let's say I'm working with several lists of 2-element tuples generally of the form (ID, value), e.g.,

list1 = [(111, 222), (111, 333), (111, 444)]
list2 = [(555, 333), (555, 444), (555, 777)]
list3 = [(123, 444), (123, 888), (123, 999)]

What I really want to do is find an easy (and computationally efficient) way to get the intersection of the 2nd elements of these tuples. I've looked in the Python docs and found that sets might do what I want... and this post has been helpful in helping me understand how to get the intersection of two lists.

I understand that I could make three whole new "values-only" lists by looping through the tuples like this:

newList1 = []
for tuple in list1:
   newList1.append(tuple[1])
newList2 = []
for tuple in list2:
   newList2.append(tuple[1])
newList3 = []
for tuple in list3:
   newList3.append(tuple[1])

and then get the intersection of each pair like this:

i_of_1and2 = set(newList1).intersection(newList2)
i_of_1and3 = set(newList2).intersection(newList3)
i_of_2and3 = set(newList1).intersection(newList3)

But my lists are a bit large - like hundreds of thousands (sometimes tens of millions) of tuples. Is this really the best way to go about getting the intersection of the 2nd elements in these three lists tuples? It seems...inelegant...to me.

Thanks for any help!

It's generally a good idea to give example output - what you expect to get. — Gareth Latty, May 14 '12 at 01:29
Are those lists ordered by second element of each tuple? Are you allowed to reorder elements on those lists? If so, you could try a merge-sort-like algorithm and avoid copying data. — liori, May 14 '12 at 01:35
good point. I expect the output to be a list or set containing the intersection of the 2nd elements in the original list1 and list2...i.e., the common values in the the (ID, value) tuples. So, for i_of_1and2, I expect [333,444]. — CJH, May 14 '12 at 01:36
naah, they aren't sorted by 2nd element. The 1st element is a person's ID number, and the 2nd element is a product code. I want to know common products. — CJH, May 14 '12 at 01:38

score 3 · Accepted Answer · edited May 23 '17 at 12:25

You are showing a large problem to begin with variable1 is generally a bad sign - if you want to have multiple values, use a data structure, not lots of variables with numbered names. This stops you repeating your code over and over, and helps stop bugs.

Let's use a list of lists instead:

values = [
    [(111, 222), (111, 333), (111, 444)],
    [(555, 333), (555, 444), (555, 777)],
    [(123, 444), (123, 888), (123, 999)]
]

Now we want to get only the second element of each tuple in the sublists. This is easy enough to compute using a list comprehension:

>>> [[item[1] for item in sublist] for sublist in values]
[[222, 333, 444], [333, 444, 777], [444, 888, 999]]

And then, we want the intersections between the items, we use itertools.combinations() to get the various pairs of two possible:

>>> for values, more_values in itertools.combinations(new_values, 2):
...     set(values).intersection(more_values)
... 
{444, 333}
{444}
{444}

So, if we wrap this together:

import itertools

values = [
    [(111, 222), (111, 333), (111, 444)],
    [(555, 333), (555, 444), (555, 777)],
    [(123, 444), (123, 888), (123, 999)]
]

sets_of_first_items = ({item[1] for item in sublist} for sublist in values)
for values, more_values in itertools.combinations(sets_of_first_items, 2):
    print(values.intersection(more_values))

Which gives us:

{444, 333}
{444}
{444}

The change I made here was to make the inner list a set comprehension, to avoid creating a list just to turn it into a set, and using a generator expression rather than a list comprehension, as it's lazily evaluated.

As a final note, if you wanted the indices of the lists we are using to generate the intersection, it's simple to do with the enumerate() builtin:

sets_of_first_items = ({item[1] for item in sublist} for sublist in values)
for (first_number, first_values), (second_number, second_values) in itertools.combinations(enumerate(sets_of_first_items), 2):
    print("Intersection of {0} and {1}: {2}".format(first_number, second_number, first_values.intersection(second_values)))

Which gives us:

Intersection of 0 and 1: {444, 333}
Intersection of 0 and 2: {444}
Intersection of 1 and 2: {444}

Edit:

As noted by tonyl7126, this is also an issue that could be greatly helped by using a better data structure. The best option here is to use a dict of user id to a set of product ids. There is no reason to store your data as a list when you only need a set, and are going to convert it to a set later, and the dict is a much better solution for the type of data you are trying to store.

See the following example:

import itertools

values = {
    "111": {222, 333, 444},
    "555": {333, 444, 777},
    "123": {444, 888, 999}
}

for (first_user, first_values), (second_user, second_values) in itertools.combinations(values.items(), 2):
    print("Intersection of {0} and {1}: {2}".format(first_user, second_user, first_values.intersection(second_values)))

Giving us:

Intersection of 555 and 123: {444}
Intersection of 555 and 111: {444, 333}
Intersection of 123 and 111: {444}

+1 Same way I did it :D The only thing is that the OP has specified Python 2.7 so your final example won't be cross compatible. — jamylak, May 14 '12 at 01:39
Awesome! I'll give this a shot, thanks. And also thanks for the learning moment re: vars with numbered names. — CJH, May 14 '12 at 01:42
@CJH In my final example, just remove the brackets from the print line - in 2.x, print is a statement, not a function. Edit: Updated to a version that works in 2.x and 3.x. — Gareth Latty, May 14 '12 at 01:48
@Lattyware sorry my bad, it would work in Python 2.7 but not earlier. — jamylak, May 14 '12 at 01:56
@Lattyware: got it, thanks. It worked like a champ in Python 2.7 — CJH, May 14 '12 at 01:56

score 2 · Answer 2 · answered May 14 '12 at 01:45

I'm not sure if you've read about dictionaries in python yet, but that seems like it might fit what you are trying to do better in combination with lists. Dictionaries are made up of keys and values, just like what you seem to be emulating with your 2 element tuples.

So for example, list1, list2, and list3 could be represented as a dictionary that would look like this (assuming 111 is the id): your_dict = {"111": [222, 333, 444], "555": [333, 444, 777], "123":[444, 888, 999]}

So, if you wanted to get all of the values for a specific id, like "111", you would write: your_dict.get("111") and that would return the list. Here is a link to some documentation on dictionaries as well. http://docs.python.org/library/stdtypes.html#typesmapping

I'd note that it's a really odd thing to say that you'd use `your_dict.get("111")`. 99.9% of the time in Python you'd be using `your_dict["111"]`. — Gareth Latty, May 14 '12 at 01:59

score 1 · Answer 3 · answered May 14 '12 at 01:36

You could take advantage of the fact that the set.intersection(...) method takes 2 or more sets and finds their intersection. Also, you can use list comprehensions to reduce the code bloat. And lastly, you can use argument list unpacking to make it a one-liner. For example:

>>> list1 = [(111, 222), (111, 333), (111, 444)]
>>> list2 = [(555, 333), (555, 444), (555, 777)]
>>> list3 = [(123, 444), (123, 888), (123, 999)]
>>>
>>> set.intersection(*[set(t[1] for t in l) for l in (list1, list2, list3)])
set([444])

To help you understand what's going on, the call to set.intersection(...) is equivalent to this python code:

>>> allsets = []
>>> for l in (list1, list2, list3):
...   n = set()
...   for t in l:
...     n.add(t[1])
...   allsets.append(n)
... 
>>> allsets
[set([444, 333, 222]), set([777, 444, 333]), set([888, 444, 999])]
>>> allsets[0].intersection(allsets[1]).intersection(allsets[2])
set([444])

ah, this looks like something I'll use, too. Thanks! – CJH May 14 '12 at 01:46 — CJH, May 14 '12 at 01:46

score 1 · Answer 4 · answered May 14 '12 at 04:38

Here is a easy way of doing it.

>>> list1 = [(111, 222), (111, 333), (111, 444)]
>>> list2 = [(555, 333), (555, 444), (555, 777)]
>>> list3 = [(123, 444), (123, 888), (123, 999)]
>>> lists = [list1, list2, list3]
>>> set.intersection(*(set(zip(*list)[1]) for list in lists))
set([444])

The zip * trick is used to unzip the tuples and get the sets of 2nd elements.
set.intersection * is used to intersect them all together.

With regards to efficiency, I would try the easy way first and see if that is fast enough before trying to optimize.

Easy & efficient way to get intersection of the 2nd elements for several lists of tuples in Python?

4 Answers4