0

I have a data with fastq format:

@HISEQ:157:C11RCACXX:6:1101:1522:2491 2:N:0:CGTACG
GTGCCNNNNNNNNNNNNNNNNNNNNNNNTGCGNNNNNNNNNNNNNNCNNGCAGATACTCGTANNNNNNNNNGNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
+
@BCFF###########################################################################
#####################
@HISEQ:157:C11RCACXX:6:1101:1668:2494 2:N:0:CGTACG
TCTTTNNNNNNNNNNNNNNNNNNNNNNNATTGNNNNNNNNNNNNNNTTNTGTTTTACGGTTTNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
+
C@CFF###########################################################################
#####################
@HISEQ:157:C11RCACXX:6:1101:2557:2492 2:N:0:CGTACG
CCTCTNNNNNNNNNNNNNNNNNNNNNNNGTTGNNNNNNNNNNNNNNCNNCAACACACTCCTCNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
+
CCCFF###########################################################################
#####################

and I want to split each read with "+" used awk command, but it didnt't work, Is there simple command with see/awk can convert it into fasta format?

The expect output should be

>HISEQ:157:C11RCACXX:6:1101:1522:2491 2:N:0:
CGTACGGTGCCNNNNNNNNNNNNNNNNNNNNNNNTGCGNNNNNNNNNNNNNNCNNGCAGATACTCGTANNNNNNNNNGNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>HISEQ:157:C11RCACXX:6:1101:1668:2494 2:N:0:
CGTACGTCTTTNNNNNNNNNNNNNNNNNNNNNNNATTGNNNNNNNNNNNNNNTTNTGTTTTACGGTTTNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>HISEQ:157:C11RCACXX:6:1101:2557:2492 2:N:0:
CGTACGCCTCTNNNNNNNNNNNNNNNNNNNNNNNGTTGNNNNNNNNNNNNNNCNNCAACACACTCCTCNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN

Thanks a lot!

Leandro Papasidero
  • 3,728
  • 1
  • 18
  • 33
Yevis
  • 63
  • 5
  • In addition, I hope the last six reads of first line can merge with other reads, thanks! – Yevis Oct 23 '13 at 00:38
  • Do you have to use awk? If not, try googling: fastq to fasta. – Tyler Oct 23 '13 at 01:18
  • 2
    if you expect your readers to know what `fast(a) format` "looks like", you're greatly reducing the number of people that can help you. Consider editing your question to include the required output given your sample data. Also recall that `RS="+";ORS="\n"` will tell `awk` to split records at each '+' char, and write out revised records with just a newline char. Good luck. – shellter Oct 23 '13 at 01:30

2 Answers2

0

You may try the following

awk -f conv.awk input.txt

where input.txt is your input data file, and conv.awk is

/@HISEQ/ { p=1; sub(/^@/,">"); sub(/:[^:]*$/,":"); print; next }
/^\+/ {p=0}
p==1 { print }
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
0
awk '((/@/&&$0!~/#/)||$0!~/#/)&&$0!~/\+/' your_file

Tested Below:

> cat temp2
@HISEQ:157:C11RCACXX:6:1101:1522:2491 2:N:0:CGTACG
GTGCCNNNNNNNNNNNNNNNNNNNNNNNTGCGNNNNNNNNNNNNNNCNNGCAGATACTCGTANNNNNNNNNGNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
+
@BCFF###########################################################################
#####################
@HISEQ:157:C11RCACXX:6:1101:1668:2494 2:N:0:CGTACG
TCTTTNNNNNNNNNNNNNNNNNNNNNNNATTGNNNNNNNNNNNNNNTTNTGTTTTACGGTTTNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
+
C@CFF###########################################################################
#####################
@HISEQ:157:C11RCACXX:6:1101:2557:2492 2:N:0:CGTACG
CCTCTNNNNNNNNNNNNNNNNNNNNNNNGTTGNNNNNNNNNNNNNNCNNCAACACACTCCTCNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
+
CCCFF###########################################################################
#####################
>
> nawk '((/@/&&$0!~/#/)||$0!~/#/)&&$0!~/\+/' temp2
@HISEQ:157:C11RCACXX:6:1101:1522:2491 2:N:0:CGTACG
GTGCCNNNNNNNNNNNNNNNNNNNNNNNTGCGNNNNNNNNNNNNNNCNNGCAGATACTCGTANNNNNNNNNGNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
@HISEQ:157:C11RCACXX:6:1101:1668:2494 2:N:0:CGTACG
TCTTTNNNNNNNNNNNNNNNNNNNNNNNATTGNNNNNNNNNNNNNNTTNTGTTTTACGGTTTNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
@HISEQ:157:C11RCACXX:6:1101:2557:2492 2:N:0:CGTACG
CCTCTNNNNNNNNNNNNNNNNNNNNNNNGTTGNNNNNNNNNNNNNNCNNCAACACACTCCTCNNNNNNNNGCNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>
Vijay
  • 65,327
  • 90
  • 227
  • 319
  • Thank you, it's a nice code, but I expect the last 6 reads of first line should be merge with other reads. – Yevis Oct 23 '13 at 16:28