I am making a .fasta parser script. (I know some parsers already exist for .fasta files but I need practice dealing with large files and thought this was a good start).
The goal of the program: take a very large .fasta
file with multiple sequences, and return the Reverse compliment of each sequence in a new file.
I have a script that reads it one line at a time but each line is only about 50 bytes in a regular .fasta file. So buffering only one line at a time is not necessary. Is there a way I can set the amount of lines to buffer at a time?
For people who don't know what fasta is: A .fasta
file is a text file with DNA sequences each with a header line before the DNA/RNA/Protein sequence marked by a '>'. Example:
>Sequence_1
ATG
TATA
>Sequence_2
TATA
GACT
ATG
Overview of my code:
Read the first byte of each line to Map the location of the sequences in the file by finding the '>'s.
Use this map to read each sequence backwards line by line (this is what I want to change to be able to)
- Reverse
- compliment the base pairs in the string (I.E. G->C)
- Write this line to a new file
Here is what should be all the relevant code so you can see the functions the lines I am trying to change are calling: (can probably skip)
def get_fasta_seqs(BIGFILE): #Returns returns an object containing all of seq locations
f = open(BIGFILE)
cnt = 0
seq_name = []
seq_start = []
seq_end = []
seqcount = 0
#print(line)
#for loop skips first line check for seq name
if f.readline(1) == '>':
seq_name.append(cnt)
seq_start.append(cnt+1)
seqcount =+ 1
for line in f:
cnt += 1
if f.readline(1) == '>':
seq_name.append(cnt)
seq_start.append(cnt+1)
seqcount += 1
if seqcount > 1:
seq_end.append(cnt-1)
seq_end.append(cnt-1) #add location of final line
seqs = fileseq(seq_name,seq_start,seq_end,seqcount) #This class only has a __init__ function for these lists
return seqs
def fasta_rev_compliment(fasta_read,fasta_write = "default",NTtype = "DNA"):
if fasta_write == 'default':
fasta_write = fasta_read[:-6] + "_RC.fasta"
seq_map = get_fasta_seqs(fasta_read)
print(seq_map.seq_name)
f = open(fasta_write,'a')
for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
f.write(line)
my_fasta_seqs = get_fasta_seqs(fasta_read)
for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
seq = getline(fasta_read,seqline+1)
seq = seq.replace('\n','')
seq = reverse_compliment(seq,NTtype = NTtype) #this function just returns the reverse compliment for that line.
seq = seq + '\n'
f.write(seq)
f.close()
fasta_rev_compliment('BIGFILE.fasta')
The main bit of code I want to change is here:
for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
f.write(line)
my_fasta_seqs = get_fasta_seqs(fasta_read)
for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
seq = getline(fasta_read,seqline+1)
I want something like this:
def fasta_rev_compliment(fasta_read,fasta_write = "default",NTtype = "DNA",lines_to_record_before_flushing = 5):
###MORE CODE###
#i want something like this
for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
#is their a way to load
line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
f.write(line)
my_fasta_seqs = get_fasta_seqs(fasta_read)
for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
seq = getline(fasta_read,seqline+1)
#Repeat n = 5 (or other specified number) times until flushing ram.
The problem I am running into is the fact that I need to read the file backwards. All of the methods I can find don't work well when you try to apply it to reading the file backwards. Is there something that can read a file in chunks but backwards?
Or: Anything else that can make this more effective for a low memory setup. Right now it uses hardly any memory, but takes 21secs for a 100kB file with about 12,000 lines, but processes the file instantly using the file.readlines()
method.