I am working with large files holding several million records each (approx 2GB unpacked, several hundred MBs gzip).
I iterate over the records with islice
, which allows me to get either a small portion (for debug and development) or the whole thing when I want to test the code. I have noticed an absurdly large memory usage for my code and thus I am trying to find the memory leak in my code.
Below is the output from memory_profiler on a paired read (where I open two files and zip the records), for ONLY 10**5 values (the default value get overwritten).
Line # Mem usage Increment Line Contents
================================================
137 27.488 MiB 0.000 MiB @profile
138 def paired_read(read1, read2, nbrofitems = 10**8):
139 """ Procedure for reading both sequences and stitching them together """
140 27.488 MiB 0.000 MiB seqFreqs = Counter()
141 27.488 MiB 0.000 MiB linker_str = "~"
142 #for rec1, rec2 in izip(read1, read2):
143 3013.402 MiB 2985.914 MiB for rec1, rec2 in islice(izip(read1, read2), nbrofitems):
144 3013.398 MiB -0.004 MiB rec1 = rec1[9:] # Trim the primer variable sequence
145 3013.398 MiB 0.000 MiB rec2 = rec2[:150].reverse_complement() # Trim the low quality half of the 3' read AND take rev complement
146 #aaSeq = Seq.translate(rec1 + rec2)
147
148 global nseqs
149 3013.398 MiB 0.000 MiB nseqs += 1
150
151 3013.402 MiB 0.004 MiB if filter_seq(rec1, direction=5) and filter_seq(rec2, direction=3):
152 3013.395 MiB -0.008 MiB aakey = str(Seq.translate(rec1)) + linker_str + str(Seq.translate(rec2))
153 3013.395 MiB 0.000 MiB seqFreqs.update({ aakey : 1 })
154
155 3013.402 MiB 0.008 MiB print "========================================"
156 3013.402 MiB 0.000 MiB print "# of total sequences: %d" % nseqs
157 3013.402 MiB 0.000 MiB print "# of filtered sequences: %d" % sum(seqFreqs.values())
158 3013.461 MiB 0.059 MiB print "# of repeated occurances: %d" % (sum(seqFreqs.values()) - len(list(seqFreqs)))
159 3013.461 MiB 0.000 MiB print "# of low-score sequences (<20): %d" % lowQSeq
160 3013.461 MiB 0.000 MiB print "# of sequences with stop codon: %d" % starSeqs
161 3013.461 MiB 0.000 MiB print "========================================"
162 3013.504 MiB 0.043 MiB pprint(seqFreqs.most_common(100), width = 240)
The code, in short, does some filtering on the records and keeps track of how many times the strings occur in the file (zipped pair of strings in this particular case).
100 000 strings of 150 chars with integer values in a Counter should land around 100 MBs tops, which I checked using following function by @AaronHall.
Given the memory_profiler output I suspect that islice doesn't let go of the previous entities over the course of the iteration. A google search landed me at this bug report however it's marked as solved for Python 2.7 which is what I am running at the moment.
Any opinions?
EDIT: I have tried to skip islice
as per comment below and use a for loop like
for rec in list(next(read1) for _ in xrange(10**5)):
which makes no significant difference. It is in the case of a single file, in order to avoid izip
which also comes from itertools
.
A secondary troubleshooting idea i had was to check if gzip.open()
reads and expands the file to memory, and thus cause the issue here. However running the script on decompressed files doesn't make a difference.