I wrote this to retrieve FASTA sequences of a list of protein accession numbers (acc_list.txt), each on a new line, and write them to a txt file (prot_list).
x=0
with open("acc_list.txt","r") as input:
number = sum(1 for items in input) ###
with open("acc_list.txt","r") as input:
with open ("prot_list.txt","w") as output:
for acc in input:
handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
x+=1
print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
output.write(handle.read())
It is a big list so the penultimate line gives me an idea of the progress.
As you can see, number = sum(1 for items in input)
gives the total number of lines, but I have to open and close the file separate, because if I put that under the latter with
statement, i.e.
x=0
with open("acc_list.txt","r") as input:
with open ("prot_list.txt","w") as output:
for acc in input:
number = sum(1 for items in input) ###
handle = Entrez.efetch(db="protein",id=acc,rettype="fasta")
x+=1
print("Dealing with", str(acc.strip()), str(x), "out of", str(number), sep=" ")
output.write(handle.read())
it stops after counting items and gives no other outputs.
I'm guessing this is because number = sum(1 for items in input)
iterates through the file and ends the iteration too.
I am curious as to whether there is a more efficient way to obtain the number of lines in a file? I can imagine that if I work with an even bigger list, there may be problems with my approach. I've seen older answers and they all involve iterating through the file first.