More streamline way to count lines in file (python)

Question

I wrote this to retrieve FASTA sequences of a list of protein accession numbers (acc_list.txt), each on a new line, and write them to a txt file (prot_list).

x=0
with open("acc_list.txt","r") as input:
    number = sum(1 for items in input) ###
with open("acc_list.txt","r") as input:
    with open ("prot_list.txt","w") as output:
        for acc in input:
            handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
            x+=1
            print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
            output.write(handle.read())

It is a big list so the penultimate line gives me an idea of the progress.

As you can see, number = sum(1 for items in input) gives the total number of lines, but I have to open and close the file separate, because if I put that under the latter with statement, i.e.

x=0
with open("acc_list.txt","r") as input:
    with open ("prot_list.txt","w") as output:
        for acc in input:
            number = sum(1 for items in input) ###
            handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
            x+=1
            print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
            output.write(handle.read())

it stops after counting items and gives no other outputs. I'm guessing this is because number = sum(1 for items in input) iterates through the file and ends the iteration too.

I am curious as to whether there is a more efficient way to obtain the number of lines in a file? I can imagine that if I work with an even bigger list, there may be problems with my approach. I've seen older answers and they all involve iterating through the file first.

`1 + input.read().count('\n')` also works, though only timing will show if it is any faster. — John Coleman, Jun 05 '21 at 13:52
Thanks! I tried that, but it still needs to be put under its own `with` statement. I guess there is no way around it... — KcsA7466, Jun 05 '21 at 19:37
I assume you can't use with open("acc_list.txt","r") as input: inputz = input.readlines() and work on inputz , because you don't want to load the entire list at once, right ? — pippo1980, Jun 07 '21 at 10:08
@KcsA7466 if you find any of the answers helpful, please consider accepting and/or upvoting them — Jan Wilamowski, Jun 23 '21 at 05:57

pippo1980 · Answer 1 · 2021-06-07T11:44:21.003

copying from here Is there a way to shallow copy an existing file-object ?

I've ended up with:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Jun  7 11:40:04 2021

@author: Pietro


https://stackoverflow.com/questions/67850117/more-streamline-way-to-count-lines-in-file-python

"""


from Bio import Entrez

from itertools import tee




x=0
    
with open("acc_list.txt", "r") as input:
    with open ("prot_list.txt","w") as output:  
        input1, input2 = tee(input, 2)
    
    
        number = sum(1 for items in input2)-1
        print(number)
    
    
        for acc in input1:
            if acc.strip() != '': 
                try:
                    handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
                    x+=1
                    print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
                    output.write(handle.read())
                except:
                    pass

Not sure is faster or it is what you were looking for, let us know.

Besides that I noticed that at the end of my acc_list.txt file I always get end of file empty line as an empty accession number so kind of t found an elaborate way to suppress it

oops .... it was me that inserted the empty line !!! scrap that bit — pippo1980, Jun 07 '21 at 11:48
Hi pippo, thank you for the suggestions. I've been very occupied and I'll take a look soon! — KcsA7466, Jun 09 '21 at 00:30

score 0 · Answer 2 · answered Jun 08 '21 at 07:11

0

Instead of counting yourself, you can let existing tools like grep do the job:

import subprocess

p = subprocess.run(['grep', '-c', '>', 'acc_list.txt'], check=True, capture_output=True, text=True)
seq_count = int(p.stdout)

In my tests, this was faster than opening and counting in Python, especially for larger files. Counting > instead of line breaks also saves you from issues when the last line doesn't include a \n.

answered Jun 08 '21 at 07:11

Jan Wilamowski

3,308
2
10
23

Hi Jan, thank you for the suggestion. I've been very occupied and I'll take a look soon! – KcsA7466 Jun 09 '21 at 00:31

More streamline way to count lines in file (python)

2 Answers2