Find duplicate items within a list of list of tuples Python

Question

I want to find the matching item from the below given list.My List may be super large.

The very first item in the tuple "N1_10" is duplicated and matched with another item in another array

tuple in 1st array in the ListA ('N1_10', 'N2_28')
tuple in 2nd array in the ListA ('N1_10', 'N3_98')

ListA  = [[('N1_10', 'N2_28'), ('N1_35', 'N2_44')],
          [('N1_22', 'N3_72'), ('N1_10', 'N3_98')],
          [('N2_33', 'N3_28'), ('N2_55', 'N3_62'), ('N2_61', 'N3_37')]]

what I want for the output is

output --> [('N1_10','N2_28','N3_98') , .... and the rest whatever match one of the key will get into same tuple]

If you guys think , changing the data structure of the ListA is better option , pls feel free to advise! Thanks for helping out!

SIMPLIFIED VERSION

List A = [[(a,x),(b,k),(c,l),(d,m)],[(e,d),(a,p),(g,s)],[...],[...]....]

wantedOutput --> [(a,x,p),(b,k),(c,l),(d,m,e),(g,s).....]

I meant, what factors the output? I'm not seeing a pattern, if you get what I mean — TerryA, Jul 05 '13 at 07:06
Oh wait, I see. Is it only the first item in the tuple which is a duplicate of something else? — TerryA, Jul 05 '13 at 07:07
i have edited the question. The item may be any in any position but each tuple got only 2 items.. Thanks for helping out @Haidro :) — Peter, Jul 05 '13 at 07:13
I'm afraid I may not know the answer to this, but I'll see... — TerryA, Jul 05 '13 at 07:14
it's ok :) i know there are genius out there! so should be fine ^^ — Peter, Jul 05 '13 at 07:18
Can each string not occur more than twice? In your simplified version, what would happen if a is also in sub-list 3? — Jared, Jul 05 '13 at 07:39
related: [Replace list of list with “condensed” list of list while maintaining order](http://stackoverflow.com/q/13714755/4279) — jfs, Jul 05 '13 at 08:10

user2357112 · Accepted Answer · 2013-07-05T19:10:22.783

Update: After rereading your question, it appears that you're trying to create equivalence classes, rather than collecting values for keys. If

[[(1, 2), (3, 4), (2, 3)]]

should become

[(1, 2, 3, 4)]

, then you're going to need to interpret your input as a graph and apply a connected components algorithm. You could turn your data structure into an adjacency list representation and traverse it with a breadth-first or depth-first search, or iterate over your list and build disjoint sets. In either case, your code is going to suddenly involve a lot of graph-related complexity, and it'll be hard to provide any output ordering guarantees based on the order of the input. Here's an algorithm based on a breadth-first search:

import collections

# build an adjacency list representation of your input
graph = collections.defaultdict(set)
for l in ListA:
    for first, second in l:
        graph[first].add(second)
        graph[second].add(first)

# breadth-first search the graph to produce the output
output = []
marked = set() # a set of all nodes whose connected component is known
for node in graph:
    if node not in marked:
        # this node is not in any previously seen connected component
        # run a breadth-first search to determine its connected component
        frontier = set([node])
        connected_component = []
        while frontier:
            marked |= frontier
            connected_component.extend(frontier)

            # find all unmarked nodes directly connected to frontier nodes
            # they will form the new frontier
            new_frontier = set()
            for node in frontier:
                new_frontier |= graph[node] - marked
            frontier = new_frontier
        output.append(tuple(connected_component))

Don't just copy this without understanding it, though; understand what it's doing, or write your own implementation. You'll probably need to be able to maintain this. (I would've used pseudocode, but Python is practically as simple as pseudocode already.)

In case my original interpretation of your question was correct, and your input is a collection of key-value pairs that you want to aggregate, here's my original answer:

Original answer

import collections

clusterer = collections.defaultdict(list)

for l in ListA:
    for k, v in l:
        clusterer[k].append(v)

output = clusterer.values()

defaultdict(list) is a dict that automatically creates a list as the value for any key that wasn't already present. The loop goes over all the tuples, collecting all values that match up to the same key, then creates a list of (key, value_list) pairs from the defaultdict.

(The output of this code is not quite in the form you specified, but I believe this form is more useful. If you want to change the form, that should be a simple matter.)

First, the output from this code is a list of a list of characters. You should use ```append``` not ```+=```. Second, (if you used append), the code loses relation information. I think I know what you were trying to do, but it's not right. — korylprince, Jul 05 '13 at 07:59
@korylprince: Crap, you're right about `append`. My bug. For the other thing, I originally read the question as trying to aggregate a bunch of key-value pairs, but upon a second reading, it appears to be a connected-components analysis of a graph. I'll update my response with a connected-components algorithm. — user2357112, Jul 05 '13 at 08:04
@user2357112 , yah , it is kind of adjacency matrix list representation. if you have time could you pls tell how how to change it ? — Peter, Jul 05 '13 at 08:46
@Peter: I've added an example implementation of a breadth-first search-based algorithm. — user2357112, Jul 05 '13 at 19:11

score 2 · Answer 2 · edited Mar 14 '16 at 05:51

2

tupleList = [(1, 2), (3, 4), (1, 4), (3, 2), (1, 2), (7, 9), (9, 8), (5, 6)]

newSetSet = set ([frozenset (aTuple) for aTuple in tupleList])
setSet = set ()

while newSetSet != setSet:
    print '*'
    setSet = newSetSet
    newSetSet = set ()
    for set0 in setSet:
        merged = False
        for set1 in setSet:
            if set0 & set1 and set0 != set1:
                newSetSet.add (set0 | set1)
                merged = True
        if not merged:
            newSetSet.add (set0)

        print [tuple (element) for element in setSet]
        print [tuple (element) for element in newSetSet]
        print 

print [tuple (element) for element in newSetSet]

# Result:  [(1, 2, 3, 4), (5, 6), (8, 9, 7)]

edited Mar 14 '16 at 05:51

Shyam Bhimani

1,310
1
22
37

answered Jul 05 '13 at 07:56

Jacques de Hooge

6,750
2
28
45

Tested, take out the print statements for speed. – Jacques de Hooge Jul 05 '13 at 09:26
Technically, it works (I think), but that's going to have atrocious performance. In the worst case, it'll run in exponential time, because it may end up going through every possible combination of 2 or more input elements. – user2357112 Jul 05 '13 at 18:34
Each time the while loop is done, at least one merge has happened, which makes the cardinality of newSetSet one less. So I would expect O (n^3) rather than O (exp (n) for the three nested loops. But I may very well have overlooked something. If so, please explain. – Jacques de Hooge Jul 05 '13 at 19:13
A set can merge with many other sets each iteration, not just one. These merges don't produce one big merged set; they produce a bunch of partially merged sets. newSetSet can actually grow when this happens. Try it with `tupleList = [(x, y) for x in range(n) for y in range(x)]` for various values of `n`. – user2357112 Jul 05 '13 at 19:21

korylprince · Answer 3 · 2013-07-06T02:37:08.697

2

Does output order matter? This is the simplest way I could think of:

ListA  = [[('N1_10', 'N2_28'), ('N1_35', 'N2_44')],[('N1_22', 'N3_72'), ('N1_10', 'N3_98')],
            [('N2_33', 'N3_28'), ('N2_55', 'N3_62'), ('N2_61', 'N3_37')]]

idx = dict()

for sublist in ListA:
    for pair in sublist:
        for item in pair:
            mapping = idx.get(item,set())
            mapping.update(pair)
            idx[item] = mapping 
            for subitem in mapping:
                submapping = idx.get(subitem,set())
                submapping.update(mapping)
                idx[subitem] = submapping


for x in set([frozenset(x) for x in idx.values()]):
    print list(x)

Output:

['N3_72', 'N1_22']
['N2_28', 'N3_98', 'N1_10']
['N2_61', 'N3_37']
['N2_33', 'N3_28']
['N2_55', 'N3_62']
['N2_44', 'N1_35']

edited Jul 06 '13 at 02:37

answered Jul 05 '13 at 08:07

korylprince

2,969
1
18
27

I'm pretty sure that's bugged. Whenever an input pair is processed, the set corresponding to the second item of the pair gets replaced with the set corresponding to the first item. This loses data if the second item already had stored associations. – user2357112 Jul 05 '13 at 08:18
Actually i am pretty new to python. can u explain more about forzenset? i juz now confirm with my data set , there is no duplicated data in each array of tuples however , the items may duplicated for different array across the big listA. – Peter Jul 05 '13 at 09:11
@user2357112 good catch. Updated the code to fix this and readability. @Peter - a ```frozenset``` is just a ```set``` that can't be modified. I use that line of code because you can't have a ```set``` of ```set```s in python (because a ```set``` isn't hashable) but you can have a ```set``` of ```frozenset```s. – korylprince Jul 06 '13 at 02:38

Find duplicate items within a list of list of tuples Python

3 Answers3

Linked