I have over 14000 fasta files, and I want to keep only the ones containing 5 sequences. I know I can use the following bash command to obtain the number of sequences in a single fasta file:
grep -c "^>" filename.fasta
So my approach was to write the the filename and count of sequences in each file to a text file, which I could then use to isolate only the sequences I want. To run the grep command on so many files, I am using subprocess.call:
import subprocess
import os
with open("five_seqs.txt", "w") as f:
for file in os.listdir("/Users/vivaksoni1/Downloads/DA_CDS/fasta_files"):
f.write(file),
subprocess.call(["grep", "-c", "^>", file], stdout = f)
Part of my problem is that the grep command is "^>", but subprocess requires each argument to have its own quotation marks. How can I use "^>" when I would essentially be entering as an argument: ""^>"".
Also, do I have to add f.write("\n") after f.write(file)? Currently my output is just a text file with each entry next to one another, and the subprocess command just prints each file name to the terminal and states no file found as such:
grep: MZ23900789.fasta: No such file or directory