gene extraction from a fasta file according to sequence with biopython

Question

I have a fasta file that I am parsing. The file is composed of several sequences that belongs to the same gene from a different bacterial strain. What I want to do, and the script does is to check if the sequence are equal or different to the reference seq. With that information I want to produce a new file BUT I only one one sequence of each.

def checking_sequences():
gene_records=list(SeqIO.parse('/files/gene_A.fasta', 'fasta'))
ref_id=gene_records[-1].id
ref_seq=gene_records[-1].seq
#print gene_records[-1].description
output_handle=open('//files/' + 'corrected_gene_1', 'a')
print len(gene_records)
count=0
dif_count=0
reference_list=[]

for gene_record in gene_records:
    #count+=1
    if len(gene_record.seq) == len(ref_seq):
    #print len(gene_records.seq)
    #print len(ref_seq)
        print 'Found all lengths are equal'                     
    else:
        print 'Found %s sequence with different lengths' % (gene_records.description)

    ###checking sequence equality
    if gene_record.seq==ref_seq:
        count+=1
        gene_record.id=gene_record.id +'_0'
        reference_list.append(gene_record)
        ref_count=reference_list.count(gene_record.seq)
        print 'There are %i sequences are  equal to the reference sequence' %(count)    
    else:       
        dif_count+=1
        reference_list.append(gene_record.seq)
        seq_count=reference_list.count(gene_record.seq)
        gene_record.id=gene_record.id +'_'+ str(dif_count)
        print 'Found  %i  different that ref_seq' % (seq_count)
        print 'xxxxxxxxxxx'




        #print seq_count
        #print len(reference_list)  
    SeqIO.write(gene_record, output_handle, 'fasta')


output_handle.close()

checking_sequences() For some clarification :

original file                           desire output
    >gene_1 strainIDx                     >gen1_strainIDx
    seqA                                    seqA
    >gene_1 strainIDy                      >gene_1 strainIDy
   seqB                                       seqB
    >gene_1 strainIDz
    seqA

I don't mind about the ID just I would like to have one seq of each. I have tried to use 'break'but I don't get the output I would like to. Help will be appreciate

stovfl · Accepted Answer · 2017-09-13T10:11:02.910

0

Comment: I have never heard of hash so I didn't know what it does or doesn't.

Reference:

Built-in Functions hash(object)
Return the hash value as integers.
They are used to quickly compare dictionary keys during a dictionary lookup.

SO QA Built in python hash() function

Question: I want to produce a new file BUT only one sequence of each.

Use a Lookup Table, for example:

lookup = Set()

with open('/files/' + 'corrected_gene_1', "w") as handle: 
    for record in SeqIO.parse('/files/gene_A.fasta', "fasta"):
        seq_hash = hash(str(record.seq))

        if not seq_hash in lookup:
            # Not in lookup, save
            lookup.add(seq_hash)
            SeqIO.write(records, handle, "fasta")
        else:
            # already saved - Skip
            pass

edited Sep 13 '17 at 10:11

answered Sep 12 '17 at 13:46

stovfl

14,998
7
24
51

And in wihich part of my script I should place it, please? – Ana Sep 12 '17 at 15:01
I don't understand your code very much. Before I create my file I firstly need to see if len(seq) are equal. If there are equal then I add them to the file. Now I'm very puzzled. – Ana Sep 12 '17 at 16:13
1

@Ana: Equal Length don't mean equal sequence, but equal `hash` does. My Example code **filters** all duplicate sequence. If you need another Condition [Edit] your Question accordingly. – stovfl Sep 12 '17 at 17:05
@stovfl Why adding an `else` statement? seems unnecessary here – rodgdor Sep 12 '17 at 17:57
@rodgdor: Remove it if don't need it. You can use it for Statistics e.g. count, print or write to another File duplicate sequence. – stovfl Sep 12 '17 at 18:36
@stovfl: I have never heard of hash so I didn't know what it does or doesn't. But if it short my code, I will be very happy using it. I am going to get some reading about it. If you know a nice link just let me know. One is always learning new things! Thanks. – Ana Sep 13 '17 at 08:01

gene extraction from a fasta file according to sequence with biopython

1 Answers1