1

I have a function that I am trying to reduce the memory footprint of. The maximum amount of memory I can use is only 500MB. It seems that using .split('\t') and for loops is really using a lot of memory. Is there are way that I can reduce this memory usage?

Line #    Mem usage  Increment   Line Contents
==============================================
10     35.4 MiB      0.0 MiB   @profile
11                             def function(username):
12     35.4 MiB      0.0 MiB       key = s3_bucket.get_key(username)
13     85.7 MiB     50.2 MiB       file_data = key.get_contents_as_string()
14    159.3 MiB     73.6 MiB       g = [x for x in file_data.splitlines() if not x.startswith('#')]
15    144.8 MiB    -14.5 MiB       del file_data
16    451.8 MiB    307.1 MiB       data = [x.split('\t') for x in g]
17    384.0 MiB    -67.8 MiB       del g
18
19    384.0 MiB      0.0 MiB       d = []
20    661.7 MiB    277.7 MiB       for row in data:
21    661.7 MiB      0.0 MiB           d.append({'key': row[0], 'value':row[3]})
22    583.7 MiB    -78.0 MiB       del data
25    700.8 MiB    117.1 MiB       database[username].insert_many(d)
26    700.8 MiB      0.0 MiB       return

UPDATE1

As per the suggestion of @Jean-FrançoisFabre and @Torxed, it's an improvement but the generators still seem to take a large amount of memory.

@martineau I'd prefer to use MongoDB .insert_many() as iterating over the keys and performing .insert() is much slower.

20     35.3 MiB      0.0 MiB   @profile
21                             def function(username):
22     85.4 MiB     50.1 MiB       file_data = s3_bucket.get_key(username).get_contents_as_string()
23    610.5 MiB    525.2 MiB       data = (x.split('\t') for x in isplitlines(file_data) if not x.startswith('#'))
24    610.5 MiB      0.0 MiB       d = ({'key': row[0], 'value':row[3]} for row in data)
25    123.3 MiB   -487.2 MiB       database[username].insert_many(d)
26    123.3 MiB      0.0 MiB       return

UDPATE2

I've identified the source of the memory usage as this profile shows:

21     41.6 MiB      0.0 MiB   @profile
22                             def insert_genotypes_into_mongodb(username):
23     91.1 MiB     49.4 MiB       file_data = s3_bucket.get_key(username).get_contents_as_string()
24     91.1 MiB      0.0 MiB       genotypes = (x for x in isplitlines(file_data) if not x.startswith('#'))
25     91.1 MiB      0.0 MiB       d = ({'rsID': row.split('\t')[0], 'genotype':row.split('\t')[3]} for row in genotypes)
26                                 # snps_database[username].insert_many(d)
27     91.1 MiB      0.0 MiB       return

The insert_many() function clearly resolves the previous lines causing the whole list to be loaded into memory and confuses the profiler.

The solution is insert the keys into MongoDB in chunks:

22     41.5 MiB      0.0 MiB   @profile
23                             def insert_genotypes_into_mongodb(username):
24     91.7 MiB     50.2 MiB       file_data = s3_bucket.get_key(username).get_contents_as_string()
25    180.2 MiB     88.6 MiB       genotypes = (x for x in isplitlines(file_data) if not x.startswith('#'))
26    180.2 MiB      0.0 MiB       d = ({'rsID': row.split('\t')[0], 'genotype':row.split('\t')[3]} for row in genotypes)
27     91.7 MiB    -88.6 MiB       chunk_step = 100000
28
29     91.7 MiB      0.0 MiB       has_keys = True
30    127.4 MiB     35.7 MiB       keys = list(itertools.islice(d,chunk_step))
31    152.5 MiB     25.1 MiB       while has_keys:
32    153.3 MiB      0.9 MiB           snps_database[username].insert_many(keys)
33    152.5 MiB     -0.9 MiB           keys = list(itertools.islice(d,chunk_step))
34    152.5 MiB      0.0 MiB           if len(keys) == 0:
35    104.9 MiB    -47.6 MiB               has_keys = False
36                                 # snps_database[username].insert_many(d[i*chunk_step:(i+1)*chunk_step])
37    104.9 MiB      0.0 MiB       return

Thanks for all the help.

WillJones
  • 907
  • 1
  • 9
  • 19
  • Iterate over generators and generator expressions (or other lazily evaluated constructs) in lieu of lists. – juanpa.arrivillaga Jan 04 '17 at 17:13
  • I'm not sure about this but have you tried forcing garbage collection by `import gc` and then `gc.collect()` statement after `del g` and `del data`. – Gurupad Mamadapur Jan 04 '17 at 17:13
  • 1
    `data = [x.split('\t') for x in g]` That's because you're not using the list as a iterator, you're using is basically as `x = list(something)` which has to wait for ALL data to be collected before creating the `data` variable. use a `for obj in x.split()` instead. – Torxed Jan 04 '17 at 17:13
  • Could you incrementally insert the contents of `d` instead of putting all those dictionaries in the list with one call to `database[username].insert_many(d)`? If so, then you can use some of the other suggestions for incrementally processing the rest of data instead of reading and maintaining it all in memory at once. Another approach would be to write it out to a temporary file and read it back in a row/line at a time. – martineau Jan 04 '17 at 17:18

2 Answers2

1

First, don't use splitlines() as it creates a list, you need an iterator. So you could use Iterate over the lines of a string example to get an iterator version of splitlines():

def isplitlines(foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

Personal note: this isn't very efficient because of string concatenation. I've rewritten that using a list and str.join. My version:

def isplitlines(buffer):
    retval = []
    for char in buffer:
        if not char == '\n':
            retval.append(char)
        else:
            yield "".join(retval)
            retval = []
    if retval:
        yield "".join(retval)

then, avoid using del as no intermediate list (except the one splitting the rows) are used. Just "compress" your code, skipping the g part and create d as a generator comprehension instead of a list comprehension:

def function(username):
   key = s3_bucket.get_key(username)
   file_data = key.get_contents_as_string()
   data = (x.split('\t') for x in isplitlines(file_data) if not x.startswith('#'))
   d = ({'key': row[0], 'value':row[3]} for row in data)
   database[username].insert_many(d)

this may be "onelined" a little more, but would be difficult to understand. And the current code is OK. See that as chained generator comprehension/expressions working together with only one big source chunk of memory: file_data

Community
  • 1
  • 1
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Thanks! See my updated post there's an improvement but still has a very large footprint. – WillJones Jan 05 '17 at 10:10
  • BTW can someone explain that line `24 610.5 MiB 0.0 MiB d = ({'key': row[0], 'value':row[3]} for row in data)`. What are the values? before/after? and why does the generator declaration (not running) allocates that much memory? that may be the key of all this. – Jean-François Fabre Jan 05 '17 at 10:26
  • Yes - these are the return values from memory profiler: https://pypi.python.org/pypi/memory_profiler – WillJones Jan 05 '17 at 11:00
0

The solution is insert the keys into MongoDB in chunks:

22     41.5 MiB      0.0 MiB   @profile
23                             def insert_genotypes_into_mongodb(username):
24     91.7 MiB     50.2 MiB       file_data = s3_bucket.get_key(username).get_contents_as_string()
25    180.2 MiB     88.6 MiB       genotypes = (x for x in isplitlines(file_data) if not x.startswith('#'))
26    180.2 MiB      0.0 MiB       d = ({'rsID': row.split('\t')[0], 'genotype':row.split('\t')[3]} for row in genotypes)
27     91.7 MiB    -88.6 MiB       chunk_step = 100000
28
29     91.7 MiB      0.0 MiB       has_keys = True
30    127.4 MiB     35.7 MiB       keys = list(itertools.islice(d,chunk_step))
31    152.5 MiB     25.1 MiB       while has_keys:
32    153.3 MiB      0.9 MiB           snps_database[username].insert_many(keys)
33    152.5 MiB     -0.9 MiB           keys = list(itertools.islice(d,chunk_step))
34    152.5 MiB      0.0 MiB           if len(keys) == 0:
35    104.9 MiB    -47.6 MiB               has_keys = False
36                                 # snps_database[username].insert_many(d[i*chunk_step:(i+1)*chunk_step])
37    104.9 MiB      0.0 MiB       return
WillJones
  • 907
  • 1
  • 9
  • 19