19

I have a data in that always comes in block of four in the following format (called FASTQ):

@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/

Is there a simple sed/awk/bash way to convert them into this format (called FASTA):

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

In principle, we want to extract the first two lines in each block-of-4 and replace @ with >.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
neversaint
  • 60,904
  • 137
  • 310
  • 477

14 Answers14

26

This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:

sed -n '1~4s/^@/>/p;2~4p' 

The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.

The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)

Owen
  • 3,063
  • 5
  • 30
  • 26
  • Thank you! Because of this nice line of code, I finally made decision to learn sed more thoroughly. Here is a good source: grymoire.com/Unix/Sed.html#uh-15 I still have one question though: what does "~" do there please? Thanks – Helene Nov 02 '16 at 18:04
  • 4
    To translate the sed expression verbatim: "starting on line 1, and every 4th line thereafter, when you see a @ character at the beginning of a line, substitute it with a > character, and print the resulting line; then, starting at line 2, and every 4th line thereafter, just print the line". The -n option turns off automatic printing, and the 'p' in the two sed expressions selectively prints lines that match the expression. hope this helps! – Owen Nov 03 '16 at 20:08
9

sed ain't dead. If we're golfing:

sed '/^@/!d;s//>/;N'

Or, emulating http://www.ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl posted by Pierre, which only prints the first word (the id) from the first line and does (some) error handling:

#!/usr/bin/sed -f
# Read a total of four lines
$b error
N;$b error
N;$b error
N
# Parse the lines
/^@\(\([^ ]*\).*\)\(\n[ACGTN]*\)\n+\1\n.*$/{
  # Output id and sequence for FASTA format.
  s//>\2\3/
  b
}
:error
i\
Error parsing input:
q

There seem to be plenty of existing tools for converting these formats; you should probably use these instead of anything posted here (including the above).

Mark Edgar
  • 4,707
  • 2
  • 24
  • 18
  • 1
    sed is very much alive, but the sed solution offered here will likely sink your workflow. You can't rely on the @ character to uniquely indicate header lines -- quality lines can also start with @. Please see my fix below. – Owen Apr 05 '13 at 20:32
9

As detailed in Cock, et al (2009) NAR, many of these solutions are incorrect since "the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence."

See http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217 for details.

C. Bergman
  • 131
  • 1
  • 2
  • Not true for any solution specifying that the @ character is at the begining of the line with '^@', which seems to represent the majority of the answers. Cheers – Morlock Apr 13 '11 at 15:42
  • 2
    Actually, this is a true statement since a quality value of @ can in principle occur anywhere in the quality string including the first character, '^@' is not guaranteed to catch the name lines only. – C. Bergman Aug 10 '11 at 11:08
  • Indeed. Sorry for not having taken a few more seconds to think about the problem properly. Cheers – Morlock Aug 26 '11 at 00:02
7

just awk , no need other tools

# awk '/^@SR/{gsub(/^@/,">",$1);print;getline;print}' file
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
4

See fastq2fasta.pl in http://www.ringtail.tsl.ac.uk/david-studholme/scripts/

Pierre
  • 34,472
  • 31
  • 113
  • 192
3

I'd write

awk '
    NR%4 == 1 {print ">" substr($0, 2)}
    NR%4 == 2 {print}
' fastq > fasta
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
2

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ print ">"$name ORS $seq }' file.fastq

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

kvantour
  • 25,269
  • 4
  • 47
  • 72
2

This is the fastest I've got, and I stuck it in my .bashrc file:

alias fq2fa="awk '{print \">\" substr(\$0,2);getline;print;getline;getline}'"

It doesn't fail on the infrequent but not impossible quality lines that start with @... but does fail on wrapped FASTQ, if that's even legal (it exists though).

Peter
  • 1,983
  • 1
  • 12
  • 9
1
awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' data

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

below

awk '{gsub(/^[@]/,">"); print}' data

where data is your data file. I've received:

>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
bua
  • 4,761
  • 1
  • 26
  • 32
1

Here's the solution to the "skip every other line" part of the problem that I just learned from SO:

while read line
do
    # print two lines
    echo "$line"
    read line_to_print
    echo "$line_to_print"

    # and skip two lines
    read line_to_skip
    read line_to_skip
done

If all that needs to be done is change one @ to >, then I reckon

while read line
do
    echo "$line" | sed 's/@/>/'
    read line
    echo "$line"

    read line_to_skip
    read line_to_skip
done

will do the job.

mob
  • 117,087
  • 18
  • 149
  • 283
  • there should be an input redirection for the input file. to replace characters in bash, ${line/@/>} should suffice. no need sed. – ghostdog74 Oct 09 '09 at 12:00
1

Something like:

awk 'BEGIN{a=0}{if(a==1){print;a=0}}/^@/{print;a=1}' myFastqFile | sed 's/^@/>/'

should work.

mouviciel
  • 66,855
  • 13
  • 106
  • 140
  • since you are using awk already, there is no need to waste an extra process calling sed. do the substitution inside awk. – ghostdog74 Oct 09 '09 at 11:58
1

I think, with gnu grep this could be done with this:

grep -A 1 "^@" t.txt | grep -v "^--" | sed -e "s/^@/\>/"
dz.
  • 1,286
  • 12
  • 12
1

I know I'm way in the future, but for the benefit of googlers:

You may want to use fastq_to_fasta from the fastx toolkit. It will keep the @ sign, though. It will also remove lines with Ns unless you tell it not to.

mmarchin
  • 108
  • 1
  • 7
0

Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. For example, for common next generation sequencing data processing tasks, use seqtk.

seqtk seq -A in.fq > out.fa

To install these tools, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

References:

seqtk: https://github.com/lh3/seqtk
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
conda create: https://docs.conda.io/projects/conda/en/latest/commands/create.html

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47