how can I reformat sequences (several lines) in a fasta file to single line?

Question

Input "file.fasta" (note, this is a sample .... in fasta file, the sequences may have more than three lines)

>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGtt
aGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTG
GTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaa
cCCCCCGACGACGACTCACGA

Expected output

>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

my effort: sed command

sed ':a;N;$!ba;s/\([actgACGT]\)\n\([actgACGT]\)/\1\2/g' file.fasta

my wrong output:

>chr1:117223140-117223856 TAG:GTGGGGTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCTACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

The regular expression for header (lines whose first letter is ">") is "^>.*$", but I do not know how to include in sed command

thanks in advance

@tripleee you are right is duplicated, I had not seen – Jose Ricardo Bustos M. Sep 19 '15 at 15:32 — Jose Ricardo Bustos M., Sep 19 '15 at 15:32

John1024 · Answer 1 · 2015-09-18T21:49:30.763

$ awk '/^>/ {print (NR>1?"\n":"")$0;; next} {printf "%s",$0;} END{print "";}' file.fasta 
>chr1:117223140-117223856 TAG:GTGGG
GTGGgggggcgCATAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGAGttaGTAGTATCGAATCGCACGACTGACAGCTCAGCATCAGCGACGACTAGTGGTGGGCGACGACAgCGATATA
>chr2:117223140-117223856 TAG:GGGCT
ACGAGCAGCAGCAGCAGCagCCGATCGACGACTCAAGTACGATACGCGaacCCCCCGACGACGACTCACGA

How it works

/^>/ {print (NR>1?"\n":"")$0;; next}

If the line starts with >, that is if the regex /^>/ is true, then print the line. If this is not the first line, that is if NR>1, then print a newline character ahead of the line. Then, skip the rest of the commands and jump to start over on the next line.
printf "%s",$0;

For all other lines, print them without a trailing newline.
END{print "";}

After we have reached the end of the file, print one last newline character.

score 1 · Accepted Answer · answered Sep 19 '15 at 15:13

1

This might work for you (GNU sed):

sed ':a;N;/>/!s/\n//;ta;P;D' file

Look at two lines and if either does not contains a > delete the newline between them and repeat. If either of the lines does contain a > then print and delete the first of them and then repeat.

answered Sep 19 '15 at 15:13

potong

55,640
6
51
83

I will study `sed`, I still have much to learn ..... thanks a lot – Jose Ricardo Bustos M. Sep 19 '15 at 15:23

how can I reformat sequences (several lines) in a fasta file to single line?

2 Answers2

How it works