3

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.

(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:

Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)

Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)

Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.

Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.

for sequence in ./sequences/{.,}*;
    do
            echo "$sequence";
            blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
            cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
    done

When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".

If I can provide any further information to clarify anything, please let me know.

lynkyra
  • 59
  • 5
  • 2
    It looks like at least one of your problems is that you've tried to run the same function multiple times in the same directory. Each time you run it, I believe your loop finds new files that you've generated in a previous run and tries to operate on them also. As far as I can tell, you are not restricting your file search to files ending in `*.fa`, but I would recommend you do that. Otherwise you're going to keep processing your newly-outputted `.txt` files and generating more erroneous output. – aardvarkk Nov 23 '16 at 03:26
  • I agree, I do need to do that. I guess another way to solve that would be to make all my output files output to a separate directory. How would I make it only iterate through files ending in *.fa? Do I put that in line 1? – lynkyra Nov 23 '16 at 03:35

3 Answers3

3

If you're only interested in *.fa files I would limit your input to only those matching files like this:

for sequence in sequences/*.fa; do

aardvarkk
  • 14,955
  • 7
  • 67
  • 96
0

I can propose you the following improvements:

for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
    # ${variable%something} is the part of $variable
    # before the string "something"
    # basename path/to/file is the name of the file
    # without the full path
    # $(some command) allows you to use the result of the command as a string
    # Combining the above, we can form a string based on our fasta file
    # This string can be useful to name stuff in a clean manner later
    sequence_name=$(basename ${fasta_file%.fa})
    echo ${sequence_name}
    # Create a directory for the results for this sequence
    # -p option avoids a failure in case the directory already exists
    mkdir -p ${sequence_name}
    # Define the name of the file for the results
    # (including our previously created directory in its path)
    blast_results=${sequence_name}/${sequence_name}_blast.txt
    blastp -query ${fasta_file} -db database.faa \
        -out ${blast_results} \
        -evalue 1e-10 -outfmt 7
    # Define a file name for the top hits
    top_hits=${sequence_name}/${sequence_name}_top_hits.txt
    # alternatively, using "%"
    #top_hits=${blast_results%_blast.txt}_top_hits.txt
    # No need to cat: awk can take a file as argument
    awk '/hits found/{getline;print}' ${blast_results} \
        | grep -v "#" > ${sequence_name}_top_hits.txt
done

I made more intermediate variables, with (hopefully) meaningful names. I used \ to escape line ends and allow putting commands in several lines. I hope this improves code readability.

I haven't tested. There may be typos.

bli
  • 7,549
  • 7
  • 48
  • 94
0

You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using

mkdir 'folder_name'

then you need to redirect your -o outputs to those files, something like this

'command' -o /path/to/output/folder

To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.

One last thing, be careful with your use of colons, it should look something like this:

for filename in *.fa; do 'command'; done