renaming fasta headings with awk - reg expressions

Question

I have consensus sequences of a three segmented virus genome (the three segments are named L, M or S respectively), so inside each genome fasta file I have three fasta file looking like this:

>Toscana_virus_L_(consensus)_(consensus)
TTAACCATTCATCCCCTGAGGAGGTATGAATCATCAATTTATGACACTCCAATACCAGCC
..

>Toscana_virus_M_(consensus)_(consensus)
AATATACTATTATTTCAGAGATAGGGAACGGCACTAGAACTTCCTTTTTAGAAGCTTGGG
..

>Toscana_virus_S_(consensus)_(consensus)
NNACAAAGACCTCCCGTATTGCTAAACCAGAACTAATAATAGACTTCTAGACAGCCATGC
..

I want to change the heading of the fasta file with their proper sample name.

My file sample names look like this:

LCR_1152; 
LCR367 , etc

So this is what I did:

cp *.fasta to_rename/  
mkdir renamed 
cd to_rename
for filename in *.fasta; do filename2=$(echo $filename | sed 's/.*\(LCR_?\).*\([0-9][0-9][0-9][0-9]?$\).*/\1\2/'); awk -v a="$filename2" '/^>/{print ">"a; next}{print}' < $filename > ../renamed/$filename ; done

And it worked well but the problem is that now inside each file, the threee segments have the same heading, I lost the distinction of L, M or S.

For example this is what I get:

>LCR_1152
TTAACCATTCATCCCCTGAGGAGGTATGAATCATCAATTTATGACACTCCAATACCAGCC
..

>LCR_1152
AATATACTATTATTTCAGAGATAGGGAACGGCACTAGAACTTCCTTTTTAGAAGCTTGGG
..

>LCR_1152
NNACAAAGACCTCCCGTATTGCTAAACCAGAACTAATAATAGACTTCTAGACAGCCATGC
..

But what I want is the following ..

>LCR_1152_L
TTAACCATTCATCCCCTGAGGAGGTATGAATCATCAATTTATGACACTCCAATACCAGCC
..

>LCR_1152_M
AATATACTATTATTTCAGAGATAGGGAACGGCACTAGAACTTCCTTTTTAGAAGCTTGGG
..

>LCR_1152_S
NNACAAAGACCTCCCGTATTGCTAAACCAGAACTAATAATAGACTTCTAGACAGCCATGC
..

In order not to lose the identity of the fragments.

I dont know how to solve it, my attempts have been unsuccessful :(

Does anyone know how to work that out?

What is the significance of the second and subsequent lines in the text file with the new headings? — tripleee, Jul 10 '20 at 15:32
This pattern you mean? **TTAACCATTCATCCCCTGAGGAGGTATGAATCATCAATTTATGACACTCCAATACCAGCC"** Is the genome sequence! — Fabiana, Jul 10 '20 at 16:44
No, I mean `LCR367, etc`, what should `LCR3672` be used for and what happens at `etc`? — tripleee, Jul 10 '20 at 18:37

mchelabi · Answer 1 · 2020-07-11T12:18:37.240

2

If I understand, this is what you want to do:

for file in fic1 fic2 ...
do
    awk -v f="$file" ' 
    />/{
        if($0 ~ /_L_/){suffix="_L"}
        if($0 ~ /_M_/){suffix="_M"}
        if($0 ~ /_S_/){suffix="_S"}
        
        sf=$0
        gsub(/>.*/, f, $0)
        print ">"$0""suffix
        $0=sf
    } 
    !/>/{
       print 
    }' "$file" > /renamedpath/"$file"
done

result:

>LCR_1152_L
TTAACCATTCATCCCCTGAGGAGGTATGAATCATCAATTTATGACACTCCAATACCAGCC
..

>LCR_1152_M
AATATACTATTATTTCAGAGATAGGGAACGGCACTAGAACTTCCTTTTTAGAAGCTTGGG
..

>LCR_1152_S
NNACAAAGACCTCCCGTATTGCTAAACCAGAACTAATAATAGACTTCTAGACAGCCATGC

edited Jul 11 '20 at 12:18

answered Jul 10 '20 at 19:00

mchelabi

154
5

double quotes are not necessary on the 'file' variable for this example, and the $files is just an example. – mchelabi Jul 11 '20 at 08:46
Omitting double quotes when they are redundant is harmless, but we don't know what the OP's actual file names look like. Missing quotes trip up beginners all the time, and are hard to debug even for more advanced users. Better safe than sorry. See also [When to wrap quotes around a shell variable?](https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable) – tripleee Jul 11 '20 at 10:50
You had a syntax error in the `files` assignment (commas where the shell requires just spaces), and again, this generalizes poorly to arbitrary file names. You could put the file names in an array variable, but a variable actually adds no value at all here, and consumes (granted, a very tiny armount of) memory for no reason. – tripleee Jul 11 '20 at 10:52
Reading the question again, I guess this will solve the OP's problem, if you edit it to fix the missing `>`. Maybe mention how you ignored the (too complex) renaming scheme which wasn't part of the actual question anyway. – tripleee Jul 11 '20 at 10:56
There were no syntax errors in the file names. The '$files' variable was just an example. The user is free to choose how to recover the files. I know very well when to put the double quotes (I am a shell specialist, awk for more than 13 years, so I am not a beginner) ;) – mchelabi Jul 11 '20 at 11:04
I'm not saying you are a beginner, I'm saying the code will be problematic for a beginner who tries to adapt it to, say, file names with spaces in them. – tripleee Jul 11 '20 at 11:20

renaming fasta headings with awk - reg expressions

1 Answers1