9

i have a multiple processes each dealing with lists that have 40000 tuples. this nearly maxes the memory available on the machine. if i do this:

        while len(collection) > 0:
            row = collection.pop(0)
            row_count = row_count + 1
            new_row = []
            for value in row:
                if value is not None:
                    in_chars = str(value)
                else:
                    in_chars = ""

                #escape any naughty characters
                new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
            new_row = "\t".join(new_row)
            rows += "\n"+new_row
            if row_count % 5000 == 0:
                gc.collect()

does this free more memory ?

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
tipu
  • 9,464
  • 15
  • 65
  • 98
  • 2
    I don't mean to be snide but why don't you just try it and see? – John Y Apr 05 '13 at 02:53
  • figured someone might know whether it's overall worth it with the overhead of the GC or some python internal stuff going on that i would have overlooked. – tipu Apr 05 '13 at 02:59
  • 2
    If it's possible, it might be worth looking into the possibility of using iterators and instead of processing all 40k tuples at once, building the list and processing them at the same time. That will be some added complexity, and might not be worth the effort involved. – Moshe Apr 05 '13 at 03:06
  • i'm pulling 40k from a database. i have to put it all in memory at once, because the connection dies every couple of seconds and needs to be reset. there is no query caching on the DB, so any results not put in memory immediately are lost. – tipu Apr 05 '13 at 03:19
  • 2
    @tipu What type of variable is `collection`? `.pop(0)` is always scary unless you are using a `deque` – jamylak Apr 05 '13 at 03:32
  • Do you have reliable filesystem access or is memory all that you have? – Sean Vieira Apr 05 '13 at 03:37
  • @SeanVieira i have a reliable filesystem – tipu Apr 05 '13 at 03:48
  • @jamylak collection is a list of tuples. – tipu Apr 05 '13 at 03:49
  • @tipu Then that operation will be slow because it will require shifting every tuple in the list (O(N)). You should use [`collections.deque`](http://docs.python.org/2/library/collections.html#collections.deque) class which has the `.popleft()` method that takes O(1) time instead. – jamylak Apr 05 '13 at 03:51
  • @jamylak, MySQLDb returns a tuple of tuples. i am converting that to a list before this loop hits. is it also deadful to convert a tuple to a collections object ? – tipu Apr 05 '13 at 03:52
  • 1
    @tipu A `deque` is very similar to a list, it supports the same operations. It won't be slower if that's what you are asking. – jamylak Apr 05 '13 at 03:54
  • @jamylak throw these suggestions into an answer and i'll accept. – tipu Apr 05 '13 at 04:00
  • 1
    @tipu I don't really have time to look into your question, just skimmed over, so I don't think it's good enough to be an answer. – jamylak Apr 05 '13 at 04:01
  • @jamylak you may not have hit the nail on the head but provided a ton of other support and wanted to accept your answer. thanks anyway ! – tipu Apr 05 '13 at 04:11

2 Answers2

8

Since the collection is shrinking at the same rate that rows is growing, your memory usage will remain stable. The gc.collect() call is not going to make much difference.

Memory management in CPython is subtle. Just because you remove references and run a collection cycle doesn't necessarily mean that the memory will be returned to the OS. See this answer for details.

To really save memory, you should structure this code around generators and iterators instead of large lists of items. I'm very surprised you say you're having connection timeouts because fetching all the rows should not take much more time than fetching a row at a time and performing the simple processing you are doing. Perhaps we should have a look at your db-fetching code?

If row-at-a-time processing is really not a possibility, then at least keep your data as an immutable deque and perform all processing on it with generators and iterators.

I'll outline these different approaches.

First of all, some common functions:

# if you don't need random-access to elements in a sequence
# a deque uses less memory and has faster appends and deletes
# from both the front and the back.
from collections import deque
from itertools import izip, repeat, islice, chain
import re

re_redshift_chars = re.compile(r'[abcdefg]')

def istrjoin(sep, seq):
    """Return a generator that acts like sep.join(seq), but lazily

    The separator will be yielded separately
    """
    return islice(chain.from_iterable(izip(repeat(sep), seq)), 1, None)

def escape_redshift(s):
    return re_redshift_chars.sub(r'\\\g<0>', s)

def tabulate(row):
    return "\t".join(escape_redshift(str(v)) if v is not None else '' for v in row)

Now the ideal is row-at-a-time processing, like this:

cursor = db.cursor()
cursor.execute("""SELECT * FROM bigtable""")
rowstrings = (tabulate(row) for row in cursor.fetchall())
lines = istrjoin("\n", rowstrings)
file_like_obj.writelines(lines)
cursor.close()

This will take the least possible amount of memory--only a row at a time.

If you really need to store the entire resultset, you can modify the code slightly:

cursor = db.cursor()
cursor.execute("SELECT * FROM bigtable")
collection = deque(cursor.fetchall())
cursor.close()
rowstrings = (tabulate(row) for row in collection)
lines = istrjoin("\n", rowstrings)
file_like_obj.writelines(lines)

Now we gather all results into collection first which remains entirely in memory for the entire program run.

However we can also duplicate your approach of deleting collection items as they are used. We can keep the same "code shape" by creating a generator that empties its source collection as it works. It would look something like this:

def drain(coll):
    """Return an iterable that deletes items from coll as it yields them.

    coll must support `coll.pop(0)` or `del coll[0]`. A deque is recommended!
    """
    if hasattr(coll, 'pop'):
        def pop(coll):
            try:
                return coll.pop(0)
            except IndexError:
                raise StopIteration
    else:
        def pop(coll):
            try:
                item = coll[0]
            except IndexError:
                raise StopIteration
            del coll[0]
            return item
    while True:
        yield pop(coll)

Now you can easily substitute drain(collection) for collection when you want to free up memory as you go. After drain(collection) is exhausted, the collection object will be empty.

Community
  • 1
  • 1
Francis Avila
  • 31,233
  • 6
  • 58
  • 96
2

If your algorithm depends on pop'ing from the left side or beginning of a list, you can use deque object from collections as a faster alternative.

As a comparison:

import timeit

f1='''
q=deque()
for i in range(40000):
    q.append((i,i,'tuple {}'.format(i)))

while q:
    q.popleft()
'''

f2='''
l=[]
for i in range(40000):
    l.append((i,i,'tuple {}'.format(i)))

while l:
    l.pop(0)
'''

print 'deque took {:.2f} seconds to popleft()'.format(timeit.timeit(stmt=f1, setup='from collections import deque',number=100))
print 'list took {:.2f} seconds to pop(0)'.format(timeit.timeit(stmt=f2,number=100))

Prints:

deque took 3.46 seconds to to popleft()
list took 37.37 seconds to pop(0)

So for this particular test of popping from the beginning of the list or queue, deque is more than 10x faster.

This large advantage is only for the left side however. If you run this same test with pop() on both the speed is roughly the same. You can also reverse the list in place and pop from the right side to get the same results as popleft from the deque.


In term of 'efficiency', it will be far more efficient to process single rows from the database. If that is not an option, process your list (or deque) 'collection' in place.

Try something along these lines.

First, break out the row processing:

def process_row(row):
    # I did not test this obviously, but I think I xlated your row processing faithfully
    new_row = []
    for value in row:
        if value:
            in_chars = str(value)        
        else
            in_char=''
        new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))  
    return '\t'.join(new_row)    

Now look at using a deque to allow fast pops from the left:

def cgen(collection):
    # if collection is a deque:
    while collection:
        yield '\n'+process_row(collection.popleft())

Or if you want to stick to a list:

def cgen(collection):
    collection.reverse()
    while collection:
        yield '\n'+process_row(collection.pop())

I think that your original approach of pop(0), process the row, and call gc every 5000 rows is probably suboptimal. The gc will be called automatically far more often than that anyway.

My final recommendation:

  1. Use a deque. It just like a list but faster for left side push or pops;
  2. Use popleft() so you do not need to reverse the list (if the order of collection is meaningful);
  3. Process your collection in place as a generator;
  4. Throw out the notion of calling gc since it is not doing anything for you.
  5. Throw out 1-4 here if you can just call the db and get 1 row and process 1 row at a time!
dawg
  • 98,345
  • 23
  • 131
  • 206