1

I have a fastq file (file.fastq) of about 80GB in size which has a header line and three subsequent information lines. I need to match /1 and /2 in header lines matching the header information and put them in one file sorted by /1 and /2 (interleaved).
For example, I want to find @HS2000-1015_160:5:2114:14081:69256/2 for @HS2000-1015_160:5:2114:14081:69256/1 and if found, I want to stack them together (repeating this for other headers as well) and get the result as shown below.

I was thinking something like this, but couldn't figure out how to do this cat file.fastq | grep -A3 --no-group-separator ...

file.fastq

@HS2000-1015_160:5:1114:14809:86636/2
GTTTGGAAAACAGAAGGGATTCGTCTGGAAAGGGCGGGAAACAAAGGACTGGAAGGGAGTGATGGAAAAAAACATGATGAAACTGATCTGGAAAGAGCCA
+
@B@FFFFFHGHHHIIJJJHIJJJJIIJJJIJJJIJJIHFDDEDDBDDDDDBDDDDDDDD2?CDDDDDCCDDDDDDDDCCD@CDDDCDDEDDCDDDDC9<A
@HS2000-1015_160:5:2114:14081:69256/1
CACCCTGATCCAAGGTGGACCTCATATGCTATCCACCAATTCTTTCATTAATCTTTATCTGAGTTCCTATTACATATTTCAAAGAAACTGTCAAACAACT
+
CCCFFFFFHHHHHJJCGIIHIIJJJIJJJIIJJJJIJJ;FHIJIJJIGIGDIIJJJJJIJJIIFGIJIEIJJIJJIIIJJJHHFGHFFFFDFDEDEEDDD
@HS2000-1015_160:7:1108:13370:100570/1
GTGGGTCTTTGATCTTACAGGACAAATCCTTCAGAAAGATCAGGAAATCAAGGACCTAAAACAGAAGATAGCCGAAGTCATGGCCGTCATGCCCAGCATA
+
CCCFFFFFHHHFHGGIJIJDHEIIIGEHGGII>CFII<BG<DGCADDHBD>>BC?(8CGCCHIEC;@9@D47?BAD4,.(.6.;(3=8<A>@:<@AB9AA
@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
@HS2000-1015_160:6:1306:14563:31825/1
TTTTAAACTTATGAACATCCAACAGCAACTAATGGTAAGTCAGAAAGCTTTTATTTTTAAGTATATGCATGTGTATGTGTGTGTTTTGTATGTGTGTGTC
+
CCCFFFFFHHGHHJEDHJIJJJJIIJJJJJJGGIICFHIDGHH@HGIIIGGHFHHG>GGDHEBFIJJJHGI@FCDEEEHHGHHEHHFFFEFFFEEEEEDD
@HS2000-1015_160:7:2308:13476:86835/2
ATGGTCCTAAGTAACGTGAGGCCCAGAGGGAGGAGACCTTCAGAGGAGGCCAGGATGGGGGTGATGGGGAACTTTTCCTGAAGCTGGAGTAGGGAAAGAT
+
@@CFDFFFHFHDHHGIGJGIJIJJEG@FHGCFGBF=DFGFHGG>GGGGHIHCHFCEHHFFD'8<ACDDDD38CCDDDDCDDACDCDC?B:>@C?>A?A>>
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/1
TTGGGAGGGACATATCTAGGGTGGACCAAGCGCTGGCCAGGACTCACCTGGGCTGAAGAGGTGAGCCCAGAGGAGCTGGTGGTCTTGAAGCAGCTCCGCC
+
CCCFFFFFHHHHHIJIIIJFGCHIIIJJJIJJIJIJJJJJJJJJJJJJJJGHHHFFDDFC=ACEDDDB=?BD@ABDBDDCCBDCCCCCCCCCDDCDDDD?
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
@HS2000-1015_160:5:1110:16577:55703/2
AGAAGCTGAAGTACGGTGTAGGAATGTTTCTACCACGCATTTTCCCTGCACCCTCTCACATCATTCATGGGCCATTTGTAGACAGATGGGAAAGAGATGT
+
CC@FFFFFHGDDHGIG:C<FBHEFHIGGGIIJIIIJIFJJJJIIHIJIHHIIDHIIIJJJJJJJIHHGEHHGEFFFCEDECDDEE>@CDDCCDBCDCCDC
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
@HS2000-1015_160:5:1216:14609:76938/2
TCAAAAGGCCTCTGGAGAGATTTCTGGGTGTGGGAGGTGCTATTTTGGGGCTTGGCCAGTCATATGGAGATAAGCCTACAAGGTTGGGGACCTGGCAGAT
+
CCCFFFFFFHHGHJJHHCCAFHIGHHJJ:E?FHHHDG?FGGIIJJIJIIJIJJIJHGCEECC@CBEFECEC@C@@CCDDDDDD>CCBDDD@??CBDC72:
@HS2000-1015_160:6:2209:18284:44195/1
GCAGCCAGCGGACGTCCAGGAACCGGGATGCCTCCAGCAGTGAGGCGGTCAGCCTGCAGCATGGGATGGCTGTGGATCTTTGGGGCAGCCCTGGGGCAGT
+
CCCFFFFFHHHHHIJHIIJJJJJJIJJEHIJJGIJJJJGHCHEHGFDDDDDCDDDDDDCDDCDDDCCBDCDDBCDDDDDDDDDBBDBDDDDDBDDDDBDC
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC

Result I want:

@HS2000-1015_160:7:1108:13370:100570/1
GTGGGTCTTTGATCTTACAGGACAAATCCTTCAGAAAGATCAGGAAATCAAGGACCTAAAACAGAAGATAGCCGAAGTCATGGCCGTCATGCCCAGCATA
+
CCCFFFFFHHHFHGGIJIJDHEIIIGEHGGII>CFII<BG<DGCADDHBD>>BC?(8CGCCHIEC;@9@D47?BAD4,.(.6.;(3=8<A>@:<@AB9AA
@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
+
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/1
TTGGGAGGGACATATCTAGGGTGGACCAAGCGCTGGCCAGGACTCACCTGGGCTGAAGAGGTGAGCCCAGAGGAGCTGGTGGTCTTGAAGCAGCTCCGCC
+
CCCFFFFFHHHHHIJIIIJFGCHIIIJJJIJJIJIJJJJJJJJJJJJJJJGHHHFFDDFC=ACEDDDB=?BD@ABDBDDCCBDCCCCCCCCCDDCDDDD?
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
+
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
+
@HS2000-1015_160:6:2209:18284:44195/1
GCAGCCAGCGGACGTCCAGGAACCGGGATGCCTCCAGCAGTGAGGCGGTCAGCCTGCAGCATGGGATGGCTGTGGATCTTTGGGGCAGCCCTGGGGCAGT
+
CCCFFFFFHHHHHIJHIIJJJJJJIJJEHIJJGIJJJJGHCHEHGFDDDDDCDDDDDDCDDCDDDCCBDCDDBCDDDDDDDDDBBDBDDDDDBDDDDBDC
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC
MAPK
  • 5,635
  • 4
  • 37
  • 88
  • any idea on how far apart the matching `\1` and `\2` lines could be (eg, no more than 20 lines apart from each other)? do all header lines start with the literal `@HS` (or could they start with other letter combinations, too)? also, does the final output need to be sorted in any particular order or does that not matter as long as matching `\1` and `\2` lines are next to each other? – markp-fuso Nov 05 '20 at 18:49
  • Yes all headers start with @HS2000. The order of 4-lines groups doesn't matter as long as these four lines are together. – MAPK Nov 05 '20 at 18:53
  • a google search on `sort by multiple lines` generates a few hits including [Sort text files with multiple lines as a row](https://unix.stackexchange.com/q/333961) which has a few answers that may be of help – markp-fuso Nov 05 '20 at 19:33

2 Answers2

1

Assumptions/assertions:

  • all groups have 4 lines (ie, there are no 3-line or 5-line groups)
  • the size of the file (80 GB) precludes loading the whole thing into memory (eg, awk array) for grouping/sorting purposes
  • the ~ character does not exist in the file
  • each unique header matches at most 2 lines in the file and said matches have the suffixes /1 and /2

One awk/sort/awk idea:

  • append each group of 4 lines together using the ~ character (effectively replace \n with ~)
  • use generic sort to sort the appended lines
  • use awk to strip out only those lines that have matching /1 and /2 entries
  • replace the ~ character with newline character (\n)

NOTE: fastq.dat contains a copy of the 'before' data provided in the question

awk '{ (FNR % 4 == 0) ? sfx="\n" : sfx="~"; printf "%s%s", $0 , sfx }' fastq.dat | 
sort | 
awk -F '/' 'prevhdr != $1 { prevhdr=$1 ; prevline=$0 ; next } { gsub(/~/, "\n", prevline) ; print prevline ; gsub(/~/, "\n", $0) ; print }' > fastq.final

NOTES:

  • will obviously need enough space for another 80 GB file (perhaps on different filesystem so as to minimize FS thrashing by 2 processes reading/writing 80 GB on the same underlying device?)
  • assuming at this point the OS can handle the inflight memory requirements otherwise may need to perform each step separately while writing to individual output files (???)

OP has commented about memory usage and slow generation of output. Might be of interest to see if breaking the process into separate commands may help, albeit at the cost of needing more disk space, eg:

awk '{ (FNR % 4 == 0) ? sfx="\n" : sfx="~"; printf "%s%s", $0 , sfx }' fastq.dat > fastq.appended

sort fastq.appended > fastq.sorted 

awk -F '/' 'prevhdr != $1 { prevhdr=$1 ; prevline=$0 ; next } { gsub(/~/, "\n", prevline) ; print prevline ; gsub(/~/, "\n", $0) ; print }' fastq.sorted > fastq.final

'rm' -rf fastq.appended fastq.sorted 2>/dev/null

Assuming the sort operation may be the big pig (memory and slowness), this answer about speeding up sort may be of interest. The general idea is to increase the memory used by sort and allow for parallel threads, eg:

sort -S 50% --parallel=4

Showing that 4-line groups have been maintained:

$ head -8 fastq.final
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################

Pulling just the @HS rows to show data has been sorted as desired:

$ grep '@HS' fastq.final
@HS2000-1015_160:5:2113:11446:94436/1
@HS2000-1015_160:5:2113:11446:94436/2

@HS2000-1015_160:5:2306:10070:71746/1
@HS2000-1015_160:5:2306:10070:71746/2

@HS2000-1015_160:6:2116:4077:79041/1
@HS2000-1015_160:6:2116:4077:79041/2

@HS2000-1015_160:6:2209:18284:44195/1
@HS2000-1015_160:6:2209:18284:44195/2

@HS2000-1015_160:7:1108:13370:100570/1
@HS2000-1015_160:7:1108:13370:100570/2

NOTE: blank lines manually added solely for display purposes

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • WHay are you having first three lines in your results? `@HS2000-1015_160:5:1110:16577:55703/2 @HS2000-1015_160:5:1114:14809:86636/2 @HS2000-1015_160:5:1216:14609:76938/2` – MAPK Nov 05 '20 at 23:24
  • you only want those entries that have 2 matches (eg, the first 3 lines in my output should **not** exist)? – markp-fuso Nov 05 '20 at 23:50
  • @ Yes. I only want interleaved /1 and /2 pairs. – MAPK Nov 05 '20 at 23:54
  • 1
    added an extra `awk` step to discard singles – markp-fuso Nov 06 '20 at 00:05
  • @MAPK - assuming this works for you ... I'd be curious as to the performance metrics (eg, time to run, memory usage) – markp-fuso Nov 06 '20 at 00:30
  • I am running on a 8GB file and it just got stuck there. No output yet. I am using 35 gb RAM to process this, but still nothing has come out. – MAPK Nov 06 '20 at 00:32
  • was afraid of that ... too much data (in memory) with all the pipes; if you can't get a `msort` solution working (see link in my comment - under the question) it may be necessary to break into chunks, eg, `cat/paste` (maybe replace with a single `awk` to do the same thing), then `sort`, then `awk` (with a bit of coding could pull the `tr` code into the `awk` script by having `awk` break the line by `~` delimiter and print the the 4x parts) – markp-fuso Nov 06 '20 at 00:39
  • @MAPK ; eliminated the `cat/paste` in favor of a single `awk` call; eliminated the `tr` by rolling the conversion into the final `awk` call; also added a section whereby each step is performed as a standalone operation (to see if eliminating the pipe overhead might help with memory usage and/or slowness of processing); also added some notes about bumping up memory/threads for `sort` – markp-fuso Nov 06 '20 at 14:36
1

A bit long for a comment ... wondering if either of the following may be better suited for processing these largish data sets.

NOTE: I don't work with fastq data and the following is based on ~15 minutes of googling, so fwiw ...


1 - Generate the fastq files in a fashion that allows for faster processing

The documentation for fastq-dump mentions that use of the --split-3 flag will dump the fastq data into 3x files ... a 'left' file, a 'right' file and a 'singleton' file.

Would the 'left' and 'right' files be referring to your /1 and /2 suffixes, respectively? If 'yes' then I'm wondering if you could generate the 3x files (and then discard the 'singleton' file)?

Would also be of interest to see if the data in the 'left' and 'right' files have their headers in the same order (in which case some sort of 'merge' join between the 2x files would likely be faster than manually pre-sorting and then matching).

Assuming the 3x files could be generated, this leads to the second idea ...


2 - Are there any tools designed for matching the left/right (/1 & /2) datasets?

The brief introduction to fastq-pair seems to indicate that multiple files (like the 'left' and 'right' files generated by fastq-dump - above ??) can be matched more quickly than many brute force solutions (eg, a referenced awk solution which is a bit similar to my first answer).

markp-fuso
  • 28,790
  • 4
  • 16
  • 36