How can I optimize set matching in this case using python?

Question

I have a text file with 2 columns of "scaffolds", like this:

scaffold1|size14662     scaffold1|size14662    
scaffold1|size14662     scaffold2|size14565    
scaffold1|size14662     scaffold111160|size1478
scaffold2|size14565     scaffold2|size14565    
scaffold2|size14565     scaffold1|size14662    
scaffold2|size14565     scaffold239623|size320 
scaffold3|size14436     scaffold3|size14436    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold5|size13770    
scaffold3|size14436     scaffold149|size9055   
scaffold4|size14291     scaffold4|size14291    
scaffold4|size14291     scaffold32275|size3028 
scaffold4|size14291     scaffold66288|size2175 
scaffold5|size13770     scaffold5|size13770    
scaffold5|size13770     scaffold133|size9198   
scaffold5|size13770     scaffold149|size9055   
scaffold6|size13181     scaffold6|size13181    
scaffold6|size13181     scaffold92|size9644    
scaffold6|size13181     scaffold113496|size1447
scaffold7|size13167     scaffold7|size13167

The "scaffolds" on the right column are a "match" (as in "are the same thing") to the respective "scaffolds" on the left column, eg.:

[scaffold1|size14662, scaffold2|size14565, scaffold111160|size1478]

from the right column, are the same as scaffold1|size14662 from the left column.

What I need from this file it to get a list (not a python list, just a list) with sets of all the matching scaffolds, like this:

scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291

I was able to produce some code that does this, but it is extremely slow as it iterates through the same list over and over again. Since I am working with a file that has about 2M lines, this is not a good solution.

rawscafs = open ("columnfile")

scafs={}
for line in rawscafs:
    cont = 0
    splitvalues=line.split()
    for k,v in scafs.items():
        if splitvalues[1] in v:
            cont = 1
        elif splitvalues[0] in v:
            scafs[k].add(splitvalues[1])
            cont = 1
    if cont == 1:
        cont = 0
        continue       
    if splitvalues[0] in scafs:
        scafs[splitvalues[0]].add(splitvalues[1])
    else:
        scafs[splitvalues[0]] = set()
        scafs[splitvalues[0]].add(splitvalues[1])
rawscafs.close()


for key in scafs:
    for i in (scafs[key]):
        print(i+"\n")
    print("---\n")

rawscafs.close()

As you can see, this is ugly code, but I was just looking for a quick&dirty solution. Which I obviously did not find yet. Can anyone help me optimize this code (or provide a simpler solution, as I am sure there must be one, I just can't figure it out).

If I understand you correctly, this is known as [set consolidation](http://rosettacode.org/wiki/Set_consolidation), and can be viewed as a [connected-components problem](http://en.wikipedia.org/wiki/Connected_component_%28graph_theory%29). There are some implementations [here](http://rosettacode.org/wiki/Set_consolidation#Python) and you can see [this question](http://stackoverflow.com/questions/9110837/python-simple-list-merging-based-on-intersections) for timing tests on a number of implementations. I use the iterative version in the first link myself. — DSM, Dec 11 '13 at 15:58

score 0 · Accepted Answer · answered Dec 13 '13 at 16:16

Thanks to @DSM for providing the pointer! Using the information provided there, I was able to reach a solution for my problem. Here it is:

#!/usr/bin/python3

infile = open('columnfile','r')

title = ""
scaf = set()
scafs = []
for lines in infile:
    lines = lines.split()
    if lines[0] != title:
        title = lines[0]
        scafs.append(scaf)
        scaf = set()
        scaf.add(lines[1])
    else:
        scaf.add(lines[1])

scafs.append(scafs)
del scafs[0]
del scafs[-1]

infile.close()

def consolidate(sets):
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]


for i in consolidate(scafs):
    for a in i:
        print(a)
    print("---")

Yes, I know it's still ugly looking code, but for now it does what I need it to do. Once I insert it into my program it will certainly look better.

How can I optimize set matching in this case using python?

1 Answers1