0

I wrote this to retrieve FASTA sequences of a list of protein accession numbers (acc_list.txt), each on a new line, and write them to a txt file (prot_list).

x=0
with open("acc_list.txt","r") as input:
    number = sum(1 for items in input) ###
with open("acc_list.txt","r") as input:
    with open ("prot_list.txt","w") as output:
        for acc in input:
            handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
            x+=1
            print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
            output.write(handle.read())

It is a big list so the penultimate line gives me an idea of the progress.

As you can see, number = sum(1 for items in input) gives the total number of lines, but I have to open and close the file separate, because if I put that under the latter with statement, i.e.

x=0
with open("acc_list.txt","r") as input:
    with open ("prot_list.txt","w") as output:
        for acc in input:
            number = sum(1 for items in input) ###
            handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
            x+=1
            print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
            output.write(handle.read())

it stops after counting items and gives no other outputs. I'm guessing this is because number = sum(1 for items in input) iterates through the file and ends the iteration too.

I am curious as to whether there is a more efficient way to obtain the number of lines in a file? I can imagine that if I work with an even bigger list, there may be problems with my approach. I've seen older answers and they all involve iterating through the file first.

endo.anaconda
  • 2,449
  • 4
  • 29
  • 55
KcsA7466
  • 1
  • 1
  • `1 + input.read().count('\n')` also works, though only timing will show if it is any faster. – John Coleman Jun 05 '21 at 13:52
  • Thanks! I tried that, but it still needs to be put under its own `with` statement. I guess there is no way around it... – KcsA7466 Jun 05 '21 at 19:37
  • I assume you can't use with open("acc_list.txt","r") as input: inputz = input.readlines() and work on inputz , because you don't want to load the entire list at once, right ? – pippo1980 Jun 07 '21 at 10:08
  • @KcsA7466 if you find any of the answers helpful, please consider accepting and/or upvoting them – Jan Wilamowski Jun 23 '21 at 05:57

2 Answers2

0

copying from here Is there a way to shallow copy an existing file-object ?

I've ended up with:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Jun  7 11:40:04 2021

@author: Pietro


https://stackoverflow.com/questions/67850117/more-streamline-way-to-count-lines-in-file-python

"""


from Bio import Entrez

from itertools import tee




x=0
    
with open("acc_list.txt", "r") as input:
    with open ("prot_list.txt","w") as output:  
        input1, input2 = tee(input, 2)
    
    
        number = sum(1 for items in input2)-1
        print(number)
    
    
        for acc in input1:
            if acc.strip() != '': 
                try:
                    handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
                    x+=1
                    print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
                    output.write(handle.read())
                except:
                    pass

Not sure is faster or it is what you were looking for, let us know.

Besides that I noticed that at the end of my acc_list.txt file I always get end of file empty line as an empty accession number so kind of t found an elaborate way to suppress it

pippo1980
  • 2,181
  • 3
  • 14
  • 30
0

Instead of counting yourself, you can let existing tools like grep do the job:

import subprocess

p = subprocess.run(['grep', '-c', '>', 'acc_list.txt'], check=True, capture_output=True, text=True)
seq_count = int(p.stdout)

In my tests, this was faster than opening and counting in Python, especially for larger files. Counting > instead of line breaks also saves you from issues when the last line doesn't include a \n.

Jan Wilamowski
  • 3,308
  • 2
  • 10
  • 23