3

I am trying to append a file name at the end of certain lines in many files which I am concatenating.

short example:

INPUTS:

filename (1): 1234_contigs.fasta
>NODE_STUFF
GATTACA

filename (2): 5678_contigs.fasta
>NODE_TUFF
TGTAATC

OUTPUT:

>NODE_STUFF-1234
GATTACA
>NODE_TUFF-5678
TGTAATC

The code that I am using as a scaffold for this was commandeered from another post and my most successful iterations upon it are:

for i in ./*/*contigs.fasta; do sed '/^>NODE.*/ s/$/-(basename $i _contigs.fasta)/' /g $i; done

>NODE_STUFF-(basename $i _contigs.fasta)
GATTACA
>NODE_TUFF-(basename $i _contigs.fasta)
TGTAATC


for i in ./*/*contigs.fasta; do sed s/'^>NODE.*'$/$(basename $i _contigs.fasta)\ /g $i; done
1234 
GATTACA
4568 
TGTAATC

While I see many similar questions I am unable to find a way to do this with only certain lines in these files (which are functionally equivalent to .txt for this example). I believe my confused results are due to errors in handling literals, but after several dozen poorly recorded attempts of pushing quotation marks around I feel more lost than found. Note that each file can contain many lines starting with >NODE which I wish to append the filename too.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47

3 Answers3

3

With your shown samples, please try following awk code. We need not to use a for loop for traversing through all the files, awk is capable in reading all of them by itself. Simple explanation would be, looking for lines which are starting with > if yes then printing current line followed by - followed by current file name's value before _ else(if a line doesn't start from >) printing current line.

awk '/^>/{file=FILENAME;sub(/_.*/,"",file);print $0"-"file;next} 1' *.fasta

OR more precisely:

awk '/^>/{file=FILENAME;sub(/_.*/,"",file);$0=$0"-"file} 1' *.fasta
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    I think OP want to append the prefixing number of the filename; +1 – Fravadona May 25 '22 at 18:46
  • @Fravadona, ohh thank you for nice catch, I think I am in sleep a bit :) cheers and thank you. – RavinderSingh13 May 25 '22 at 18:48
  • for any other new folks interested in discussion on when to use either of these please view these links: [when should I use sed and when should I use awk?](https://stackoverflow.com/questions/14229377/when-should-i-use-sed-and-when-should-i-use-awk), [what are the differences between perl python awk and sed](https://stackoverflow.com/questions/366980/what-are-the-differences-between-perl-python-awk-and-sed) – statlerNwaldorf May 25 '22 at 19:13
  • 1
    @RavinderSingh13 for getting the prefixing numbers: `gsub(/^.*\/|_.*$/,"",file)` instead of `sub` – Fravadona May 25 '22 at 19:49
  • @Fravadona, sure for complete path that make sense, thank you I will edit it on morning or if you want to edit please feel free too, it's too late night here cheers – RavinderSingh13 May 25 '22 at 19:59
  • 1
    @statlerNwaldorf For the current use-case the difference is that `sed` isn't capable of doing the job with a single fork. Also, expanding shell variables inside a `sed` statement without escaping them might lead to unwanted behaviors. – Fravadona May 25 '22 at 19:59
  • fravadona, not disagreeing with you. I thought the first link in particular did a good job of elucidating why awk was a better solution for this, and it helped explain Ravinders approach. I accepted leu answer as my question tags had not been edited yet and it worked off the shelf. – statlerNwaldorf May 25 '22 at 22:04
2

with bash and sed I'd propose:

for i in ./*/*contigs.fasta; do
   n=$(basename -s _contigs.fasta "$i")
   sed "s/^\(>NODE.*\)/\1-$n/" "$i"
done
leu
  • 2,051
  • 2
  • 12
  • 25
1

Try

for file in */*_contigs.fasta; do
    filenum=${file%_contigs.fasta}
    filenum=${filenum##*/}

    sed -- "s/^>NODE.*\$/&-${filenum}/" "$file"
done
pjh
  • 6,388
  • 2
  • 16
  • 17