6

I want to replace all the headers (starting with >) with >{filename}, of all *.fasta files inside my directory AND concatenate them afterwards

content of my directory

speciesA.fasta
speciesB.fasta
speciesC.fasta

example of file, speciesA.fasta

>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

my desired output (only for speciesA.fasta now):

>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL

This is my code:

for file in *.fasta; do var=$(basename $file .fasta) | sed 's/>.*/>$var/' $var.fasta >>$var.outfile.fasta; done

but all I get is

>$var
MJSUNDKFJSKFJSKFJ
>$var
KEFJKSDJFKSDJFKSJFLSJDFLKSJF

[and so on ...]

Where did i make a mistake??

rororo
  • 815
  • 16
  • 31

2 Answers2

6

The bash loop is superfluous. Try:

awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta

This approach is safe even if the file names contain special or regex-active characters.

How it works

  • /^>/ {print ">" substr(FILENAME, 1, length(FILENAME)-6); next}

    For any line that begins >, the commands in curly braces are executed. The first command prints > followed by all but the last 6 letters of the filename. The second command, next, skips the rest of the commands on the line and jumps to start over with the next line.

  • 1

    This is awk's cryptic shorthand for print-the-line.

Example

Let's consider a directory with two (identical) test files:

$ cat speciesA.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
$ cat speciesB.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

The output of our command is:

$ awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL
>speciesB
MJSUNDKFJSKFJSKFJ
>speciesB
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesB
KSDAFJLASDJFKLAJFL

The output has the substitutions and concatenates all the input files.

John1024
  • 109,961
  • 14
  • 137
  • 171
2

In sed you need to use double quotes for variable expansion. Otherwise, they will be considered as literal text.

for file in *.fasta;
   do
       sed -i "s/>.*/${file%%.*}/" "$file" ;
done
P....
  • 17,421
  • 2
  • 32
  • 52
  • for some reason I had to modify this to work in zsh and retain the ">" `for file in *.fasta; do tag=">"${file%%.*} sed -i "s/>.*/$tag/" "$file" ; done` – Andrés Parada Jan 14 '21 at 23:54