Comparing two large files in Clojure (i.e.; finding unmapped reads in a tophat alignment)

Question

Problem: find ids that are in one file but not in another. Each file is about 6.5 GB. Specifically (for those in the bioinformatics domain), one file is a fastq file of sequencing reads and the other is a sam alignment file from a tophat run. I would like to determine which reads in the fastq file are not in the sam alignment file.

I am getting java.lang.OutOfMemory: Java heap space errors. As suggested (ref1, ref2) I am using lazy sequences. However, I am still running out of memory. I have looked at this tutorial, but I don't quite understand it yet. So I am posting my less sophisticated attempt at a solution with the hope that I am only making a minor mistake.

My attempt:

Since neither file will fit into memory, the lines from the sam file are read a chunk at a time and the ids of each line in the chunk are put into a set. A lazy list of fastq ids are then filtered using the sam ids in the set keeping only those ids that are not in the set. This is repeated with the next chunk of sam lines and the remaining fastq ids.

(defn ids-not-in-sam 
  [ids samlines chunk-size]
  (lazy-seq
    (if (seq samlines)
      (ids-not-in-sam (not-in (into #{} (qnames (take chunk-size samlines))) ids)
                      (drop chunk-size samlines) chunk-size)
      ids)))

not-in determines which ids are not in the set.

(defn not-in 
  ; Return the elements x of xs which are not in the set s
  [s xs]
  (filter (complement s) xs))

qnames gets the id field from a line in the sam file.

(defn qnames [samlines]
  (map #(first (.split #"\t" %)) samlines))

Finally, its put together with io (using read-lines and write-lines from clojure.contrib.io.

(defn write-fq-not-in-sam [fqfile samfile fout chunk-size] 
    (io/write-lines fout (ids-not-in-sam (map fq-id (read-fastq fqfile))
                                         (read-sam samfile) chunk-size)))

I am pretty sure I am doing every thing in a lazy manner. But I may be holding onto the head of a sequence somewhere that I do not notice.

Is there an error in the code above that is causing the heap to fill up? More importantly, is my approach to the problem all wrong? Is this an appropriate use for lazy sequences, am I expecting too much?

(The errors could be in the read-sam and read-fastq functions, but my post is already a bit long. I can show those later if need be).

is there some correlation between locations of ids in the two files. for example, are they grouped so that you are testing the first group in one file against the first group in another? because if not, i don't understand how this will save memory - the best you can do is to have (almost) all of one set in memory while you (lazily) run through the other set. but i don't know anything about the field... (or maybe that works because one file is much smaller than the other?) — andrew cooke, Aug 16 '11 at 18:07
No, there is no correlation. I see your point and where my mistake is: my code compares all of the first set to a subset of the second set, then repeats with what is leftover from the first set and compares that to the next subset of the second set and so on. Thus all of the first set is read and the process will not run out of memory only when most of the ids are eliminated after the first pass. This must not be the case, and thus the error. I'll try putting most of the second set in memory and see if that works. Thanks. — Neil, Aug 16 '11 at 19:11
ok, makes sense. this sounds like the kind of problem where a bloom filter might help (not quite sure which way round things are, but even if you can only use the probabilistic result it still might help reduce the amount of data). — andrew cooke, Aug 16 '11 at 19:17
fyi, reading in most (or even all) of one set into memory still gives an out of memory error, this time GC overhead limit exceeded (even with -Xmx set to 16 GB!). The bloom filter looks like an elegant solution, but beyond my skill level to implement. So sorting it is for now. — Neil, Aug 16 '11 at 22:55

score 6 · Accepted Answer · answered Aug 16 '11 at 19:21

6

sort your data sets (read chunks into memory, sort, write to temp file, merge sorted files)
iterate over both sets to find intersecting/missing elements (i hope algorithm is clear)

answered Aug 16 '11 at 19:21

aav

2,514
1
19
27

Clear enough. I will give it a try. Thanks. – Neil Aug 16 '11 at 19:39

Comparing two large files in Clojure (i.e.; finding unmapped reads in a tophat alignment)

1 Answers1