Input "file.fasta" (note, this is a sample .... in fasta file, the sequences may have more than three lines)
>chr1:117223140-117223856 TAG:GTGGG GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGtt aGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTG GTGGGCGACGACAgCGATATA >chr2:117223140-117223856 TAG:GGGCT ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaa cCCCCCGACGACGACTCACGA
Expected output
>chr1:117223140-117223856 TAG:GTGGG GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA >chr2:117223140-117223856 TAG:GGGCT ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA
my effort: sed
command
sed ':a;N;$!ba;s/\([actgACGT]\)\n\([actgACGT]\)/\1\2/g' file.fasta
my wrong output:
>chr1:117223140-117223856 TAG:GTGGGGTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA >chr2:117223140-117223856 TAG:GGGCTACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA
The regular expression for header (lines whose first letter is ">") is "^>.*$"
, but I do not know how to include in sed
command
thanks in advance