5

I'm using IPython.parallel to process a large amount of data on a cluster. The remote function I run looks like:

def evalPoint(point, theta):
    # do some complex calculation
    return (cost, grad)

which is invoked by this function:

def eval(theta, client, lview, data):
    async_results = []
    for point in data:
        # evaluate current data point
        ar = lview.apply_async(evalPoint, point, theta)
        async_results.append(ar)

    # wait for all results to come back
    client.wait(async_results)

    # and retrieve their values
    values = [ar.get() for ar in async_results]

    # unzip data from original tuple
    totalCost, totalGrad = zip(*values)

    avgGrad =  np.mean(totalGrad, axis=0)
    avgCost = np.mean(totalCost, axis=0)

    return (avgCost, avgGrad)

If I run the code:

client = Client(profile="ssh")
client[:].execute("import numpy as np")        

lview = client.load_balanced_view()

for i in xrange(100):
    eval(theta, client, lview, data)

the memory usage keeps growing until I eventually run out (76GB of memory). I've simplified evalPoint to do nothing in order to make sure it wasn't the culprit.

The first part of eval was copied from IPython's documentation on how to use the load balancer. The second part (unzipping and averaging) is fairly straight-forward, so I don't think that's responsible for the memory leak. Additionally, I've tried manually deleting objects in eval and calling gc.collect() with no luck.

I was hoping someone with IPython.parallel experience could point out something obvious I'm doing wrong, or would be able to confirm this in fact a memory leak.

Some additional facts:

  • I'm using Python 2.7.2 on Ubuntu 11.10
  • I'm using IPython version 0.12
  • I have engines running on servers 1-3, and the client and hub running on server 1. I get similar results if I keep everything on just server 1.
  • The only thing I've found similar to a memory leak for IPython had to do with %run, which I believe was fixed in this version of IPython (also, I am not using %run)

update

Also, I tried switching logging from memory to SQLiteDB, in case that was the problem, but still have the same problem.

response(1)

The memory consumption is definitely in the controller (I could verify this by: (a) running the client on another machine, and (b) watching top). I hadn't realized that non SQLiteDB would still consume memory, so I hadn't bothered purging.

If I use DictDB and purge, I still see the memory consumption go up, but at a much slower rate. It was hovering around 2GB for 20 invocations of eval().

If I use MongoDB and purge, it looks like mongod is taking around 4.5GB of memory and ipcluster about 2.5GB.

If I use SQLite and try to purge, I get the following error:

File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/controller/hub.py", line 1076, in purge_results
  self.db.drop_matching_records(dict(completed={'$ne':None}))
File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/controller/sqlitedb.py", line 359, in drop_matching_records
  expr,args = self._render_expression(check)
File "/usr/local/lib/python2.7/dist-packages/IPython/parallel/controller/sqlitedb.py", line 296, in _render_expression
  expr = "%s %s"%null_operators[op]
TypeError: not enough arguments for format string

So, I think if I use DictDB, I might be okay (I'm going to try a run tonight). I'm not sure if some memory consumption is still expected or not (I also purge in the client like you suggested).

Abe Schneider
  • 977
  • 1
  • 11
  • 23

1 Answers1

7

Is it the controller process that is growing, or the client, or both?

The controller remembers all requests and all results, so the default behavior of storing this information in a simple dict will result in constant growth. Using a db backend (sqlite or preferably mongodb if available) should address this, or the client.purge_results() method can be used to instruct the controller to discard any/all of the result history (this will delete them from the db if you are using one).

The client itself caches all of its own results in its results dict, so this, too, will result in growth over time. Unfortunately, this one is a bit harder to get a handle on, because references can propagate in all sorts of directions, and is not affected by the controller's db backend.

This is a known issue in IPython, but for now, you should be able to clear the references manually by deleting the entries in the client's results/metadata dicts and if your view is sticking around, it has its own results dict:

# ...
# and retrieve their values
values = [ar.get() for ar in async_results]

# clear references to the local cache of results:
for ar in async_results:
    for msg_id in ar.msg_ids:
        del lview.results[msg_id]
        del client.results[msg_id]
        del client.metadata[msg_id]

Or, you can purge the entire client-side cache with simple dict.clear():

view.results.clear()
client.results.clear()
client.metadata.clear()

Side note:

Views have their own wait() method, so you shouldn't need to pass the Client to your function at all. Everything should be accessible via the View, and if you really need the client (e.g. for purging the cache), you can get it as view.client.

minrk
  • 37,545
  • 9
  • 92
  • 87
  • Thank you for taking time to respond! I don't have enough space to properly respond to everything, so I'm instead putting the rest of the response at the bottom of my question. – Abe Schneider Jan 12 '12 at 22:47
  • I was able to do a full-run last night, so that fixed my problem as long as I used DictDB. I'm guessing there might be an option for MongoDB to limit its memory usage, but that's a separate issue. – Abe Schneider Jan 13 '12 at 15:38
  • Meant to add: one interesting thing is that DictDB runs much faster for my processes than MongoDB or SQLite. I'm guessing that's due to the speed at which processes finish and the overhead of book keeping that entails. – Abe Schneider Jan 13 '12 at 15:39
  • Thanks, I fixed the sqlite typo. Part of the point of the db backends is that purge_results shouldn't be necessary. If you do still use it, I would recommend that you do so as infrequently as you can. Part of the advantage of the new design is that a slow Hub *cannot* slow down execution, but that's only true if you aren't blocking on Hub operations all the time. Also, see [this post on mongodb memory usage](http://blog.mongodb.org/post/101911655/mongo-db-memory-usage) and [the relevant doc](http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage). – minrk Jan 13 '12 at 19:59
  • At least with the SQLiteDB, if I don't purge, it takes up all my memory. Thanks for the info of MongoDB. So, does that mean in theory I should be able to run Mongo without purging? – Abe Schneider Jan 13 '12 at 22:11
  • Hm, the sqlite is probably a speed issue. The SQLite backend is *slow*. I just ran a test, and submitting jobs very quickly it does grow, but during idle time it shrinks back down. For instance, my Hub right now is sitting at 60MB resident, after processing 4GB of requests, though it did peak at 933MB when it was furthest behind. I never called purge. – minrk Jan 14 '12 at 00:00
  • That's interesting. My processes don't really have idle time, so I'm guessing that's the reason for the large growth. I call purge at the end of every pass, so I think it will be okay. However, I suspect to get better efficiency I should start batching the jobs to each client. – Abe Schneider Jan 17 '12 at 15:35
  • I [recently added](https://github.com/ipython/ipython/pull/1267) a '--nodb' flag to the controller, which disables the Hub's logging of results, so if you use that you shouldn't ever have to call purge_results, as there will never be any results to purge. – minrk Jan 18 '12 at 20:05
  • Hi Abe, Hi Min, i ran into a very similar problem, I was going to post a new question, but thought I might check with you. I'm not waiting for the results, but instead am running parallel async evaluations and getting them as they come by looping through the results with a small timeout repeatedly. I'm getting the same memory leak problem. I've tried purge_results('all') and weirdly it increases the view's size (as measured by asizeof). results.clear() doesn't work as it messes up my waiting async calls. Any thoughts? Maybe I'll write a new question... – Alex S Sep 25 '13 at 16:14