1

Input "file.fasta" (note, this is a sample .... in fasta file, the sequences may have more than three lines)

>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGtt
aGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTG
GTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaa
cCCCCCGACGACGACTCACGA

Expected output

>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

my effort: sed command

sed ':a;N;$!ba;s/\([actgACGT]\)\n\([actgACGT]\)/\1\2/g' file.fasta

my wrong output:

>chr1:117223140-117223856 TAG:GTGGGGTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCTACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

The regular expression for header (lines whose first letter is ">") is "^>.*$", but I do not know how to include in sed command

thanks in advance

Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62

2 Answers2

1
$ awk '/^>/ {print (NR>1?"\n":"")$0;; next} {printf "%s",$0;} END{print "";}' file.fasta 
>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

How it works

  • /^>/ {print (NR>1?"\n":"")$0;; next}

    If the line starts with >, that is if the regex /^>/ is true, then print the line. If this is not the first line, that is if NR>1, then print a newline character ahead of the line. Then, skip the rest of the commands and jump to start over on the next line.

  • printf "%s",$0;

    For all other lines, print them without a trailing newline.

  • END{print "";}

    After we have reached the end of the file, print one last newline character.

John1024
  • 109,961
  • 14
  • 137
  • 171
1

This might work for you (GNU sed):

sed ':a;N;/>/!s/\n//;ta;P;D' file

Look at two lines and if either does not contains a > delete the newline between them and repeat. If either of the lines does contain a > then print and delete the first of them and then repeat.

potong
  • 55,640
  • 6
  • 51
  • 83