Problem: find ids that are in one file but not in another. Each file is about 6.5 GB. Specifically (for those in the bioinformatics domain), one file is a fastq file of sequencing reads and the other is a sam alignment file from a tophat run. I would like to determine which reads in the fastq file are not in the sam alignment file.
I am getting java.lang.OutOfMemory: Java heap space
errors. As suggested (ref1, ref2) I am using lazy sequences. However, I am still running out of memory. I have looked at this tutorial, but I don't quite understand it yet. So I am posting my less sophisticated attempt at a solution with the hope that I am only making a minor mistake.
My attempt:
Since neither file will fit into memory, the lines from the sam file are read a chunk at a time and the ids of each line in the chunk are put into a set. A lazy list of fastq ids are then filtered using the sam ids in the set keeping only those ids that are not in the set. This is repeated with the next chunk of sam lines and the remaining fastq ids.
(defn ids-not-in-sam
[ids samlines chunk-size]
(lazy-seq
(if (seq samlines)
(ids-not-in-sam (not-in (into #{} (qnames (take chunk-size samlines))) ids)
(drop chunk-size samlines) chunk-size)
ids)))
not-in
determines which ids are not in the set.
(defn not-in
; Return the elements x of xs which are not in the set s
[s xs]
(filter (complement s) xs))
qnames
gets the id field from a line in the sam file.
(defn qnames [samlines]
(map #(first (.split #"\t" %)) samlines))
Finally, its put together with io (using read-lines
and write-lines
from clojure.contrib.io
.
(defn write-fq-not-in-sam [fqfile samfile fout chunk-size]
(io/write-lines fout (ids-not-in-sam (map fq-id (read-fastq fqfile))
(read-sam samfile) chunk-size)))
I am pretty sure I am doing every thing in a lazy manner. But I may be holding onto the head of a sequence somewhere that I do not notice.
Is there an error in the code above that is causing the heap to fill up? More importantly, is my approach to the problem all wrong? Is this an appropriate use for lazy sequences, am I expecting too much?
(The errors could be in the read-sam
and read-fastq
functions, but my post is already a bit long. I can show those later if need be).