I have a text file with 2 columns of "scaffolds", like this:
scaffold1|size14662 scaffold1|size14662
scaffold1|size14662 scaffold2|size14565
scaffold1|size14662 scaffold111160|size1478
scaffold2|size14565 scaffold2|size14565
scaffold2|size14565 scaffold1|size14662
scaffold2|size14565 scaffold239623|size320
scaffold3|size14436 scaffold3|size14436
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold5|size13770
scaffold3|size14436 scaffold149|size9055
scaffold4|size14291 scaffold4|size14291
scaffold4|size14291 scaffold32275|size3028
scaffold4|size14291 scaffold66288|size2175
scaffold5|size13770 scaffold5|size13770
scaffold5|size13770 scaffold133|size9198
scaffold5|size13770 scaffold149|size9055
scaffold6|size13181 scaffold6|size13181
scaffold6|size13181 scaffold92|size9644
scaffold6|size13181 scaffold113496|size1447
scaffold7|size13167 scaffold7|size13167
The "scaffolds" on the right column are a "match" (as in "are the same thing") to the respective "scaffolds" on the left column, eg.:
[scaffold1|size14662, scaffold2|size14565, scaffold111160|size1478]
from the right column, are the same as scaffold1|size14662
from the left column.
What I need from this file it to get a list (not a python list, just a list) with sets of all the matching scaffolds, like this:
scaffold1|size14662
scaffold2|size14565
scaffold111160|size1478
scaffold239623|size320
---
scaffold7|size13167
---
scaffold5|size13770
scaffold3|size14436
scaffold149|size9055
scaffold133|size9198
---
scaffold92|size9644
scaffold113496|size1447
scaffold6|size13181
---
scaffold32275|size3028
scaffold66288|size2175
scaffold4|size14291
I was able to produce some code that does this, but it is extremely slow as it iterates through the same list over and over again. Since I am working with a file that has about 2M lines, this is not a good solution.
rawscafs = open ("columnfile")
scafs={}
for line in rawscafs:
cont = 0
splitvalues=line.split()
for k,v in scafs.items():
if splitvalues[1] in v:
cont = 1
elif splitvalues[0] in v:
scafs[k].add(splitvalues[1])
cont = 1
if cont == 1:
cont = 0
continue
if splitvalues[0] in scafs:
scafs[splitvalues[0]].add(splitvalues[1])
else:
scafs[splitvalues[0]] = set()
scafs[splitvalues[0]].add(splitvalues[1])
rawscafs.close()
for key in scafs:
for i in (scafs[key]):
print(i+"\n")
print("---\n")
rawscafs.close()
As you can see, this is ugly code, but I was just looking for a quick&dirty solution. Which I obviously did not find yet. Can anyone help me optimize this code (or provide a simpler solution, as I am sure there must be one, I just can't figure it out).