1

I am making a .fasta parser script. (I know some parsers already exist for .fasta files but I need practice dealing with large files and thought this was a good start).

The goal of the program: take a very large .fasta file with multiple sequences, and return the Reverse compliment of each sequence in a new file.

I have a script that reads it one line at a time but each line is only about 50 bytes in a regular .fasta file. So buffering only one line at a time is not necessary. Is there a way I can set the amount of lines to buffer at a time?

For people who don't know what fasta is: A .fasta file is a text file with DNA sequences each with a header line before the DNA/RNA/Protein sequence marked by a '>'. Example:

>Sequence_1
ATG
TATA
>Sequence_2
TATA
GACT
ATG

Overview of my code:

Read the first byte of each line to Map the location of the sequences in the file by finding the '>'s.

Use this map to read each sequence backwards line by line (this is what I want to change to be able to)

  • Reverse
  • compliment the base pairs in the string (I.E. G->C)
  • Write this line to a new file

Here is what should be all the relevant code so you can see the functions the lines I am trying to change are calling: (can probably skip)

def get_fasta_seqs(BIGFILE): #Returns returns an object containing all of seq locations
  f = open(BIGFILE)
  cnt = 0
  seq_name = []
  seq_start = []
  seq_end = []
  seqcount = 0
  #print(line)
  #for loop skips first line check for seq name
  if f.readline(1) == '>':
    seq_name.append(cnt)
    seq_start.append(cnt+1)
    seqcount =+ 1
  for line in f:
    cnt += 1
    if f.readline(1) == '>':
      seq_name.append(cnt)
      seq_start.append(cnt+1)
      seqcount += 1
      if seqcount > 1:
        seq_end.append(cnt-1)
  seq_end.append(cnt-1) #add location of final line

  seqs = fileseq(seq_name,seq_start,seq_end,seqcount) #This class only has a __init__ function for these lists
  return seqs

def fasta_rev_compliment(fasta_read,fasta_write = "default",NTtype = "DNA"):
  if fasta_write == 'default':
    fasta_write = fasta_read[:-6] + "_RC.fasta"
  seq_map = get_fasta_seqs(fasta_read)
  print(seq_map.seq_name)
  f = open(fasta_write,'a')
  for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
    line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
    f.write(line)
    my_fasta_seqs = get_fasta_seqs(fasta_read)
    for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
      seq = getline(fasta_read,seqline+1)
      seq = seq.replace('\n','')
      seq = reverse_compliment(seq,NTtype = NTtype) #this function just returns the reverse compliment for that line.
      seq = seq + '\n'
      f.write(seq)
  f.close()

fasta_rev_compliment('BIGFILE.fasta')

The main bit of code I want to change is here:

for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
  line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
  f.write(line)
  my_fasta_seqs = get_fasta_seqs(fasta_read)
  for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
    seq = getline(fasta_read,seqline+1)

I want something like this:

def fasta_rev_compliment(fasta_read,fasta_write = "default",NTtype = "DNA",lines_to_record_before_flushing = 5):
  ###MORE CODE###
  #i want something like this

  for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
    #is their a way to load
    line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
    f.write(line)
    my_fasta_seqs = get_fasta_seqs(fasta_read)
    for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
      seq = getline(fasta_read,seqline+1)
    #Repeat n = 5 (or other specified number) times until flushing ram.

The problem I am running into is the fact that I need to read the file backwards. All of the methods I can find don't work well when you try to apply it to reading the file backwards. Is there something that can read a file in chunks but backwards?

Or: Anything else that can make this more effective for a low memory setup. Right now it uses hardly any memory, but takes 21secs for a 100kB file with about 12,000 lines, but processes the file instantly using the file.readlines() method.

martineau
  • 119,623
  • 25
  • 170
  • 301
  • please show your desired output. – acushner Jul 10 '20 at 19:54
  • For this question it is helpful if you could provide a somewhat larger section of the input file. What is 'a low memory' setup, how much memory is available? And how large is a 'large' file? Is it absolutely impossible to read it into memory al at once? – Ronald Jul 10 '20 at 19:58
  • 1
    Probably not the complete solution you are looking for, but it may put you on the right track: [How to read a file in reverse order](https://stackoverflow.com/questions/2301789/how-to-read-a-file-in-reverse-order). You want to have look at the second answer which doesn't read the complete file into memory first. – Ronald Jul 10 '20 at 20:19
  • @Ronald It depends on the file. Many I deal with are above 20GBs. for the smaller ones I have another script that works great – David William Turnell Jul 10 '20 at 22:45

1 Answers1

0

Here is an example of obtaining the reverse complement of a fasta file. Perhaps you can use some ideas from this.

import re

file = """\
>Sequence_1  
ATG
TATA
>Sequence_2 
TATA
GACT
ATG""".splitlines()

s = ''
for line in file:
    line = line.rstrip()
    if line.startswith('>'):
        if len(s):
            # complement the sequence of fasta 'TAGC' to 'ATCG'
            # T to A, A to T, G to C, C to G
            s = s.translate(str.maketrans('TAGC', 'ATCG'))
            # reverse the string, 's[::-1]'
            # Also, print up to 50 fasta per line to the end of the sequence
            s = re.sub(r'(.{1,50})', r'\1\n', s[::-1])
            print(s, end='')
            s = ''
        print(line)
    else:
        s += line

# print last sequence
s = s.translate(str.maketrans('TAGC', 'ATCG'))
s = re.sub(r'(.{1,50})', r'\1\n', s[::-1])
print(s, end='')

Prints:

>Sequence_1  
TATACAT
>Sequence_2 
CATAGTCTATA
Chris Charley
  • 6,403
  • 2
  • 24
  • 26