1

I have a directory with multiple fasta file named as followed:

BC-1_bin_1_genes.faa
BC-1_bin_2_genes.faa
BC-1_bin_3_genes.faa
BC-1_bin_4_genes.faa

etc. (about 200 individual files)

The fasta header look like this:

>BC-1_k127_3926653_6 # 4457 # 5341 # -1 # ID=2_6;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.697

I now want to add the filename to the header since I want to annotate the sequences for each file.I tried the following:

for file in *.faa;
   do
       sed -i "s/>.*/${file%%.*}/" "$file" ;
done 

It worked partially but it removed the ">" from the header which is essential for the fasta file. I tried to modify the "${file%%.*}" part to keep the carrot but it always called me out on bad substitutions.

I also tried this:

awk '/>/{sub(">","&"FILENAME"_");sub(/\.faa/,x)}1' *.faa

This worked in theory but only printed everything on my terminal rather than changing it in the respective files.

Could someone assist with this?

Marlene
  • 33
  • 4
  • What would be the expected _new_ header in your example? – Fravadona Mar 23 '22 at 14:20
  • Ideally it would be e.g. for the first one >BC-1_bin_1_k127_3926653_6 # 4457 # 5341 # -1 # ID=2_6;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.697 and so on but I don't necessarily need the # 4457 # 5341 # -1 # ID=2_6;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.697" part to be retained. – Marlene Mar 23 '22 at 14:23
  • #1 Does all the files have the same **>BC-1_k127_3926653_6** and yo want to replace it with the real file name? #2 Is the pattern: text between > and the first # ? – JRichardsz Mar 23 '22 at 15:15

3 Answers3

2

It's not clear whether you want to replace the earlier header, or add to it. Both scenarios are easy to do. Don't replace text you don't want to replace.

for file in ./*.faa;
do
    sed -i "s/^>.*/>${file%%.*}/" "$file"
done

will replace the header, but include a leading > in the replacement, effectively preserving it; and

for file in ./*.faa;
do
    sed -i "s/^>.*/&${file%%.*}/" "$file"
done

will append the file name at the end of the header (& in the replacement string evaluates to the string we are replacing, again effectively preserving it).

For another variation, try

for file in *.faa;
do
    sed -i "/^>/s/\$/ ${file%%.*}/" "$file"
done

which says on lines which match the regex ^>, replace the empty string at the end of the line $ with the file name.

Of course, your Awk script could easily be fixed, too. Standard Awk does not have an option to parallel the -i "in-place" option of sed, but you can easily use a temporary file:


for file in ./*.faa;
do
    awk '/>/{ $0 = $0 " " FILENAME);sub(/\.faa/,"")}1' "$file" >"$file.tmp" &&
    mv "$file.tmp" "$file"
done

GNU Awk also has an -i inplace extension which you could simply add to the options of your existing script if you have GNU Awk.

Since FASTA files typically contain multiple headers, adding to the header rather than replacing all headers in a file with the same string seems more useful, so I changed your Awk script to do that instead.

For what it's worth, the name of the character ^ is caret (carrot is ). The character > is called greater than or right angle bracket, or right broket or sometimes just wedge.

tripleee
  • 175,061
  • 34
  • 275
  • 318
1

You just need to detect the pattern to replace and use regex to implement it:

fasta_helper.sh

location=$1

for file in $location/*.faa
do
    full_filename=${file##*/}
    filename="${full_filename%.*}"
    #scape special chars
    filename=$(echo $filename | sed 's_/_\\/_g')
    echo "adding file name: $filename to: $full_filename"
    sed -i -E "s/^[^#]+/>$filename /" $location/$full_filename
done

usage:

Just pass the folder with fasta files:

bash fasta_helper.sh /foo/bar

test:

enter image description here

lectures

JRichardsz
  • 14,356
  • 6
  • 59
  • 94
0

Locating your files

Suggesting to first identify your files with find command or ls command.

  find . -type f -name "*.faa" -printf "%f\n"

A find command to print only file with filenames extension .faa. Including sub directories to current directory.

  ls -1 "*.faa"

An ls command to print files and directories with extension .faa. In current directory.

Processing your files

Once you have the correct files list, iterate over the list and apply sed command.

  for fileName in $(find . -type f -name "*.faa" -printf "%f\n"); do
    stripedFileName=${fileName/.*/} # strip extension .faa
    sed -i "1s|\$| $stripedFileName|" "fileName" # append value of stripedFileName at end of line 1 
  done
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
  • This only manipulates the first line of each file. FASTA files commonly contain multiple sequences, each with its own header. – tripleee Mar 26 '22 at 13:33
  • `find` traverses all subdirectories. Generally [don't use `ls` in scripts.](http://mywiki.wooledge.org/ParsingLs) Also avoid parsing the output from `find` like this. The simple and obvious way to loop over all `.faa` files in the current directory is simply `for fileName in ./*.faa; do`... – tripleee Mar 26 '22 at 13:46