Converting FASTQ to FASTA with SED/AWK

Question

I have a data in that always comes in block of four in the following format (called FASTQ):

@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

Is there a simple sed/awk/bash way to convert them into this format (called FASTA):

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

In principle, we want to extract the first two lines in each block-of-4 and replace @ with >.

Owen · Accepted Answer · 2013-04-05T20:29:04.677

26

This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:

sed -n '1~4s/^@/>/p;2~4p'

The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.

The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)

edited Apr 05 '13 at 20:29

answered Apr 28 '12 at 00:19

Owen

3,063
5
30
26

Thank you! Because of this nice line of code, I finally made decision to learn sed more thoroughly. Here is a good source: grymoire.com/Unix/Sed.html#uh-15 I still have one question though: what does "~" do there please? Thanks – Helene Nov 02 '16 at 18:04
4

To translate the sed expression verbatim: "starting on line 1, and every 4th line thereafter, when you see a @ character at the beginning of a line, substitute it with a > character, and print the resulting line; then, starting at line 2, and every 4th line thereafter, just print the line". The -n option turns off automatic printing, and the 'p' in the two sed expressions selectively prints lines that match the expression. hope this helps! – Owen Nov 03 '16 at 20:08

score 9 · Answer 2 · answered Oct 09 '09 at 14:45

sed ain't dead. If we're golfing:

sed '/^@/!d;s//>/;N'

Or, emulating http://www.ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl posted by Pierre, which only prints the first word (the id) from the first line and does (some) error handling:

#!/usr/bin/sed -f
# Read a total of four lines
$b error
N;$b error
N;$b error
N
# Parse the lines
/^@\(\([^ ]*\).*\)\(\n[ACGTN]*\)\n+\1\n.*$/{
  # Output id and sequence for FASTA format.
  s//>\2\3/
  b
}
:error
i\
Error parsing input:
q

There seem to be plenty of existing tools for converting these formats; you should probably use these instead of anything posted here (including the above).

sed is very much alive, but the sed solution offered here will likely sink your workflow. You can't rely on the @ character to uniquely indicate header lines -- quality lines can also start with @. Please see my fix below. — Owen, Apr 05 '13 at 20:32

score 9 · Answer 3 · answered May 09 '10 at 20:06

9

As detailed in Cock, et al (2009) NAR, many of these solutions are incorrect since "the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence."

See http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217 for details.

answered May 09 '10 at 20:06

C. Bergman

131
1
2

Not true for any solution specifying that the @ character is at the begining of the line with '^@', which seems to represent the majority of the answers. Cheers – Morlock Apr 13 '11 at 15:42
2

Actually, this is a true statement since a quality value of @ can in principle occur anywhere in the quality string including the first character, '^@' is not guaranteed to catch the name lines only. – C. Bergman Aug 10 '11 at 11:08
Indeed. Sorry for not having taken a few more seconds to think about the problem properly. Cheers – Morlock Aug 26 '11 at 00:02

score 7 · Answer 4 · answered Oct 09 '09 at 11:57

7

just awk , no need other tools

# awk '/^@SR/{gsub(/^@/,">",$1);print;getline;print}' file
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

answered Oct 09 '09 at 11:57

ghostdog74

327,991
56
259
343

score 4 · Answer 5 · answered Oct 09 '09 at 07:45

4

See fastq2fasta.pl in http://www.ringtail.tsl.ac.uk/david-studholme/scripts/

answered Oct 09 '09 at 07:45

Pierre

34,472
31
113
192

score 3 · Answer 6 · answered Jun 30 '11 at 22:33

3

I'd write

awk '
    NR%4 == 1 {print ">" substr($0, 2)}
    NR%4 == 2 {print}
' fastq > fasta

answered Jun 30 '11 at 22:33

glenn jackman

238,783
38
220
352

score 2 · Answer 7 · answered Dec 12 '18 at 11:27

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ print ">"$name ORS $seq }' file.fastq

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

Peter · Answer 8 · 2011-11-11T19:27:34.273

2

This is the fastest I've got, and I stuck it in my .bashrc file:

alias fq2fa="awk '{print \">\" substr(\$0,2);getline;print;getline;getline}'"

It doesn't fail on the infrequent but not impossible quality lines that start with @... but does fail on wrapped FASTQ, if that's even legal (it exists though).

edited Nov 11 '11 at 19:27

answered Oct 27 '11 at 23:06

Peter

1,983
1
12
9

bua · Answer 9 · 2009-10-09T08:13:06.920

awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' data

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

below

awk '{gsub(/^[@]/,">"); print}' data

where data is your data file. I've received:

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

score 1 · Answer 10 · answered Oct 09 '09 at 07:39

Here's the solution to the "skip every other line" part of the problem that I just learned from SO:

while read line
do
    # print two lines
    echo "$line"
    read line_to_print
    echo "$line_to_print"

    # and skip two lines
    read line_to_skip
    read line_to_skip
done

If all that needs to be done is change one @ to >, then I reckon

while read line
do
    echo "$line" | sed 's/@/>/'
    read line
    echo "$line"

    read line_to_skip
    read line_to_skip
done

will do the job.

there should be an input redirection for the input file. to replace characters in bash, ${line/@/>} should suffice. no need sed. — ghostdog74, Oct 09 '09 at 12:00

score 1 · Answer 11 · answered Oct 09 '09 at 07:39

1

Something like:

awk 'BEGIN{a=0}{if(a==1){print;a=0}}/^@/{print;a=1}' myFastqFile | sed 's/^@/>/'

should work.

answered Oct 09 '09 at 07:39

mouviciel

66,855
13
106
140

since you are using awk already, there is no need to waste an extra process calling sed. do the substitution inside awk. – ghostdog74 Oct 09 '09 at 11:58

score 1 · Answer 12 · answered Oct 09 '09 at 07:40

1

I think, with gnu grep this could be done with this:

grep -A 1 "^@" t.txt | grep -v "^--" | sed -e "s/^@/\>/"

answered Oct 09 '09 at 07:40

dz.

1,286
12
12

if the file happens to be very big, there's no point chaining greps and sed together like that. – ghostdog74 Oct 09 '09 at 12:02

score 1 · Answer 13 · answered Jun 30 '11 at 19:21

1

I know I'm way in the future, but for the benefit of googlers:

You may want to use fastq_to_fasta from the fastx toolkit. It will keep the @ sign, though. It will also remove lines with Ns unless you tell it not to.

answered Jun 30 '11 at 19:21

mmarchin

108
1
7

score 0 · Answer 14 · answered Mar 22 '23 at 16:46

Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, for common next generation sequencing data processing tasks, use seqtk.

seqtk seq -A in.fq > out.fa

To install these tools, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

References:

seqtk: https://github.com/lh3/seqtk
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
conda create: https://docs.conda.io/projects/conda/en/latest/commands/create.html

Converting FASTQ to FASTA with SED/AWK

14 Answers14

Linked

Related