1

I have over 14000 fasta files, and I want to keep only the ones containing 5 sequences. I know I can use the following bash command to obtain the number of sequences in a single fasta file:

grep -c "^>" filename.fasta

So my approach was to write the the filename and count of sequences in each file to a text file, which I could then use to isolate only the sequences I want. To run the grep command on so many files, I am using subprocess.call:

import subprocess
import os


with open("five_seqs.txt", "w") as f:
    for file in os.listdir("/Users/vivaksoni1/Downloads/DA_CDS/fasta_files"):
        f.write(file),
        subprocess.call(["grep", "-c", "^>", file], stdout = f)

Part of my problem is that the grep command is "^>", but subprocess requires each argument to have its own quotation marks. How can I use "^>" when I would essentially be entering as an argument: ""^>"".

Also, do I have to add f.write("\n") after f.write(file)? Currently my output is just a text file with each entry next to one another, and the subprocess command just prints each file name to the terminal and states no file found as such:

grep: MZ23900789.fasta: No such file or directory

spiral01
  • 545
  • 2
  • 17
  • have you tried: `shell=True` on the `subprocess.call()`? Example: `subprocess.call(["grep", "-c", "^>", file], stdout=f, shell=True)` – Javier Buzzi Apr 25 '16 at 13:10
  • Hi, I did try this with no success unfortunately. The grep command is still not written to the file, and I'm getting this output to the terminal for each file: usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color=when] [--context[=num]] [--directories=action] [--label] [--line-buffered] [--null] [pattern] [file ...] – spiral01 Apr 25 '16 at 13:16
  • get a file, any just to test out: `grep -c '^>' fasta_file`.. does it work if it does, then try: `subprocess.call(["grep", "-c", "'^>'", file], stdout=f, shell=True)` otherwise something else is wrong, dissect and test the call at every turn. `pdb` is your friend -- `ipdb` is your BEST friend – Javier Buzzi Apr 25 '16 at 13:20
  • Quote the argument inside the python quotes. `'"^>"'` or `"'^>'"`. – Etan Reisner Apr 25 '16 at 14:16
  • I have tried quoting the arguments inside the python quotes but get the same terminal response as before. Having gone through my code line by line I have still not been able to decipher the root of the problem. – spiral01 Apr 25 '16 at 14:41
  • `"^>"` is a non-issue. It works as is. There are other issues: 1- `os.listdir(path)` returns filenames without the corresponding `path` (the reason for "No such file or directory" -- use `os.path.join` to prepend the `path`) 2- you should be careful while [mixing buffered `file.write` and unbuffered file descriptor level `stdout=f` operations](http://stackoverflow.com/q/22417010/4279)-- call `file.flush()`. 3- and yes, If you want filenames to be on separate lines; something has to call `file.write('\n')`. – jfs Apr 26 '16 at 19:16

1 Answers1

2

Try the following code, it should work for your example. It will write the filename plus a tab separator and the number of sequences (i.e. > characters). Using Popen and communicate gives better flexibility in handling the output. Tested on Ubuntu.

import subprocess
import os

fasta_dir = "/Users/vivaksoni1/Downloads/DA_CDS/fasta_files/"

with open("five_seqs.txt", "w") as f:
    for file in os.listdir(fasta_dir):
        f.write(file + '\t')
        grep = subprocess.Popen(["grep", "-c", "^>", fasta_dir + file], stdout = subprocess.PIPE)
        out, err = grep.communicate()
        f.write(out + '\n')
Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99