Create set from python loop output and remove duplicates

Question

I've researched the options already available on stack overflow, but none have helped given my lack of understanding of python. I have the following code, which gives the below output.

I would like to understand how to remove the duplicate lines from the output itself and save these to a file.

code:

product = contig = id = label = start = end = description = db_ref = feature_strand = cds_start = cds_end = 'NA'
with open(output.gff, "w") as ofile, open(input.gbk, "r+", encoding="utf-8") as gbkfile:
    for seq_record in SeqIO.parse(gbkfile, "genbank"):
        for feature in seq_record.features:
            if feature.type=='region':
                product=feature.qualifiers['product']
            if feature.type=='PFAM_domain':
                contig=seq_record.description
                id=seq_record.id
                label=feature.qualifiers['label']
                start=int(feature.location.start)+1
                end=int(feature.location.end)
                description=feature.qualifiers['description']
                db_ref=feature.qualifiers['db_xref']
            ofile.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\tID={}\n".format(contig, label, start, end, product, description, db_ref, id))

output:

GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_1'] 58 705 ['terpene'] ['Lycopene cyclase protein'] ['PF05834.12']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_1'] 58 705 ['terpene'] ['Lycopene cyclase protein'] ['PF05834.12']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_1'] 58 705 ['terpene'] ['Lycopene cyclase protein'] ['PF05834.12']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_1'] 218 1225 ['terpene'] ['Flavin containing amine oxidoreductase'] ['PF01593.24', 'GO:0016491', 'GO:0055114']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_1'] 218 1225 ['terpene'] ['Flavin containing amine oxidoreductase'] ['PF01593.24', 'GO:0016491', 'GO:0055114']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_1'] 218 1225 ['terpene'] ['Flavin containing amine oxidoreductase'] ['PF01593.24', 'GO:0016491', 'GO:0055114']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_2'] 1346 2065 ['terpene'] ['Squalene/phytoene synthase'] ['PF00494.19']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_2'] 1346 2065 ['terpene'] ['Squalene/phytoene synthase'] ['PF00494.19']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_2'] 1346 2065 ['terpene'] ['Squalene/phytoene synthase'] ['PF00494.19']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']

Expected output:

GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_3'] 2215 2487 ['terpene'] ['Glycosyl transferase family 2'] ['PF00535.26']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_1'] 58 705 ['terpene'] ['Lycopene cyclase protein'] ['PF05834.12']
GL_R68_GL53_UP_2_contig_68150 c00202_GL_R68_.. ['ctg202_2'] 1054 1503 ['terpene'] ['Beta-lactamase'] ['PF00144.24']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_1'] 218 1225 ['terpene'] ['Flavin containing amine oxidoreductase'] ['PF01593.24', 'GO:0016491', 'GO:0055114']
GL_R68_GL53_UP_2_contig_311481 c00750_GL_R68_.. ['ctg750_2'] 1346 2065 ['terpene'] ['Squalene/phytoene synthase'] ['PF00494.19']

Thank you for your help!

Instead of writing to the output file one line at a time, construct a set of outputs. Once you've completed the outer loop, enumerate the set and write its contents to your output file. If order is important then you'll need to do something else such as building a list and checking if the output line already exists in the list. The first option will be faster but, as stated, if order is important then you'll need the second option. Also, don't use *id* as a variable name — DarkKnight, Oct 03 '22 at 07:02
you could use sets https://www.w3schools.com/python/python_sets.asp though you should change the position of where your write to file is placed as that will always right to the file the formatted things (which you don't want to do). you should add things to the set (to make sure there are not any duplicates) and then go through the set elements and output those to a file. — Andrew Ryan, Oct 03 '22 at 07:03

Alexander · Accepted Answer · 2022-10-03T07:48:08.257

2

You can collect lines that you have already written to a set, and then check that same set on each iteration to make sure you are not writing a duplicate line. This method is likely to require the least amount of changes to your code.

product = contig = id = label = start = end = description = db_ref = feature_strand = cds_start = cds_end = 'NA'
written = set()             # create an empty set
with open('output.gff', "w") as ofile, open('input.gbk', "r+", encoding="utf-8") as gbkfile:
    for seq_record in SeqIO.parse(gbkfile, "genbank"):
        for feature in seq_record.features:
            if feature.type=='region':
                product=feature.qualifiers['product']
            if feature.type=='PFAM_domain':
                contig=seq_record.description
                id=seq_record.id
                label=feature.qualifiers['label']
                start=int(feature.location.start)+1
                end=int(feature.location.end)
                description=feature.qualifiers['description']
                db_ref=feature.qualifiers['db_xref']
            output = "{}\t{}\t{}\t{}\t{}\t{}\t{}\tID={}\n".format(contig, label, start, end, product, description, db_ref, id)
            if output not in written:   # if output not in set
                written.add(output)      # add the output to the set
                ofile.write(output)     # then write the line

edited Oct 03 '22 at 07:48

answered Oct 03 '22 at 07:26

Alexander

16,091
5
13
29

`ofile` will be rewritten on each for loop iteration, and in the end there will be just one line in `ofile`, won't it? should we use flag `a` instead of `w` when we open `ofile`? – Dmitriy Neledva Oct 03 '22 at 07:31
@DmitriyNeledva no... that is not accurate. each subsequent call to write simply appends to the previous writes. It's only after closing the file and opening again in write mode that it will overwrite the file.... – Alexander Oct 03 '22 at 07:33
why you `write` `output` but not `written`? – Dmitriy Neledva Oct 03 '22 at 07:36
@DmitriyNeledva because written is just a collection of lines that have already been written. ouput is text result from the current iteration. – Alexander Oct 03 '22 at 07:41
1

Awesome, thank you very much. This works exactly how I need it to. – Susheel Busi Oct 03 '22 at 07:47
@SusheelBusi Sorry about that... my bad – Alexander Oct 03 '22 at 07:47
1

No worries.. it's a minor issue, but your help saved me plenty of time. Thank you again! – Susheel Busi Oct 03 '22 at 07:48

Dmitriy Neledva · Answer 2 · 2022-10-03T08:05:17.853

I've commented all alterations in the code

product = contig = id = label = start = end = description = db_ref = feature_strand = cds_start = cds_end = 'NA'
with open(output.gff, "w") as ofile, open(input.gbk, "r+", encoding="utf-8") as gbkfile:
    iterable_to_write = []  #this list will be written into 'ofile'
    for seq_record in SeqIO.parse(gbkfile, "genbank"):
        for feature in seq_record.features:
            if feature.type=='region':
                product=feature.qualifiers['product']
            if feature.type=='PFAM_domain':
                contig=seq_record.description
                id=seq_record.id
                label=feature.qualifiers['label']
                start=int(feature.location.start)+1
                end=int(feature.location.end)
                description=feature.qualifiers['description']
                db_ref=feature.qualifiers['db_xref']
            # here new items will be added to the list
    #         iterable_to_write.append("{}\t{}\t{}\t{}\t{}\t{}\t{}\tID={}\n".format(contig, label, start, end, product, description, db_ref, id))

                thing_to_add = "{}\t{}\t{}\t{}\t{}\t{}\t{}\tID={}\n".format(contig, label, start, end, product, description, db_ref, id)

                #if order is important
                if thing_to_add not in iterable_to_write:
                    iterable_to_write.append(thing_to_add)

    
    # iterable_to_write=set(iterable_to_write) #getting rid of dublicates

    #final writing 
    ofile.writelines(iterable_to_write) #it's iterable so there should be 'writelines()' but not 'write()'

No need to convert the set to a list because sets are iterable. Also, as I mentioned in an earlier comment, this may not work if order is important — DarkKnight, Oct 03 '22 at 07:34
this is the version without loosing order so we don't need set conversion and this lines `if [i for i in iterable_to_write if i ==thing_to_add] == []: iterable_to_write.append(thing_to_add)` create final list without duplicates — Dmitriy Neledva, Oct 03 '22 at 07:58

Create set from python loop output and remove duplicates

2 Answers2