1

I'm trying to find a less time consuming way of splitting fastq files by sequence length, i.e. splitting one big fastq file into multiple ones containing only sequences of the same length. Input is a normal fastq file (4 lines per sequence, with the actual sequence in the second line in every quartet) with varying sequence lengths:

@HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
NCCATAAAGTAGAAAGCACT
+
#00<FFFFFFFFFIIFIIFF
@HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
TGGAGAGAAAGGCAGTTCCTGA
+
BBBFFFFFFFFFFIIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
TCCTGTACTGAGCTGCCCCGA
+
BBBFFFFFFFFFFIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
AAACCGTTACCATTACTGAGT
+
BBBFFFFFFFFFFIIIIFIII

Right now I'm using awk to filter out sequences of a specific length or within a specific range:

awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == 22) {print header, seq, qheader, qseq}}'

If I want to have an output file for every single sequence length, I manage with a for loop:

for i in {16..33};
awk -v var=$i 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == var) {print header, seq, qheader, qseq}}'
done

The problem is, that although it works fine, it is rather time consuming, because I'm checking the whole file for each length separately I guess. Additionally I need to check for the longest and the shortest sequence beforehand.

Can anyone help me finding a more efficient solution than my loop? If possible a solution where I do not have to specify a range but one that checks for the minimal and maximum length and splits them automatically. I would like to do it in awk but I'm open for everything. Thanks Benedikt

Skoddo
  • 25
  • 7
  • Reminds me of this: https://stackoverflow.com/questions/3194349/how-do-i-split-a-file-into-n-no-of-parts – Koen De Groote Apr 11 '18 at 14:21
  • Unfortunately my problem is a little bit different. I want to split my fastq file according to certain criteria (length of sequence) and not only into a certain amount of files. If you split by length you will end up with files that will have vastly varying sizes since some sequence lengths are rare while others are very abundant. But thanks for taking the time! – Skoddo Apr 11 '18 at 14:36

1 Answers1

4

something like this?

$ awk        '{rec=rec sep $0; sep=ORS} 
       !(NR%4){print rec > fn; rec=sep=""} 
       NR%4==2{fn = length($0)".seq"}' file

will generate these 3 files with contents

==> 20.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
NCCATAAAGTAGAAAGCACT
+
#00<FFFFFFFFFIIFIIFF

==> 21.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
TCCTGTACTGAGCTGCCCCGA
+
BBBFFFFFFFFFFIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
AAACCGTTACCATTACTGAGT
+
BBBFFFFFFFFFFIIIIFIII

==> 22.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
TGGAGAGAAAGGCAGTTCCTGA
+
BBBFFFFFFFFFFIIIIIIIII

since there will be a handful of these output files, there is no need to explicitly close them.

Explanation

{rec=rec sep $0; sep=ORS} build the record line by line with ORS in between lines, with lazy initialization of the separator we can eliminate the dangling first separator.

!(NR%4) if the line number is a multiple of 4

{print rec > fn; rec=sep=""} print the record to the file and reset record and separator

NR%4==2 if the line number is a 2 of 4.

{fn = length($0)".seq"} set the filename

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Thanks a lot and sorry that it took me a while before I could answer. Yor code does exactly what I'm looking for and in the less than half the time compared to my for loop. real 5m3.676s user 3m25.148s sys 0m8.816s for the for loop and real 2m2.489s user 0m32.216s sys 0m1.876s for yours. – Skoddo Apr 12 '18 at 10:47
  • On a side note: Could you explain your code a little? I'm trying to understand awk better. I can see what I have to change to get different output file names but apart from that, I'm pretty lost. – Skoddo Apr 12 '18 at 10:53
  • Once again thanks for the explanations I think I understand now what you did there. – Skoddo Apr 12 '18 at 15:35
  • No problem, pay it forward! – karakfa Apr 12 '18 at 16:01
  • I was trying to adapt your script from fastq files to fasta files (basically the same file type but with only 2 lines per sequence instead of 4). I changed every occurence of NR%4 to NR%2 thinking that this would be an easy fix but I got the following error: fatal: expression for `>' redirection has null string value. Everything else is the same. Any help is very much appreciated. – Skoddo Apr 18 '18 at 16:51
  • the line where the filename is set is not working, you can't expect `NR%2==2` to be true. change to `NR%2==0` which is the same as `!(NR%2)` so you should combine the two statements with the same condition, obviously setting the filename before you print out. – karakfa Apr 18 '18 at 17:15