18

I started using ray for distributed machine learning and I already have some issues. The memory usage is simply growing until the program crashes. Altough I clear the list constantly, the memory is somehow leaking. Any idea why ?

My specs: OS Platform and Distribution: Ubuntu 16.04 Ray installed from: binary Ray version: 0.6.5 Python version:3.6.8

I already tried using the experimental queue instead of the DataServer class, but the problem is still the same.

import numpy as np
import ray
import time
ray.init(redis_max_memory=100000000)


@ray.remote
class Runner():
    def __init__(self, dataList):
        self.run(dataList)

    def run(self,dataList):
        while True:
            dataList.put.remote(np.ones(10))

@ray.remote
class Optimizer():
    def __init__(self, dataList):
        self.optimize(dataList)

    def optimize(self,dataList):
        while True:
            dataList.pop.remote()

@ray.remote
class DataServer():
    def __init__(self):
        self.dataList= []

    def put(self,data):
        self.dataList.append(data)

    def pop(self):
        if len(self.dataList) !=0:
            return self.dataList.pop()
    def get_size(self):
        return len(self.dataList)


dataServer = DataServer.remote()
runner = Runner.remote(dataServer)
optimizer1 = Optimizer.remote(dataServer)
optimizer2 = Optimizer.remote(dataServer)

while True:
    time.sleep(1)
    print(ray.get(dataServer.get_size.remote()))

After running for some time I get this error message:

TRZUKLO
  • 193
  • 1
  • 5
  • 1
    I think you forgot to include the error message. Also, what do your print statements print? Is the length of some list growing faster than it is being cleared? Some questions/comments: 1) Can you see which process is using all of the memory (e.g., through `top`). 2) You can also try `ray.init(object_store_memory=10**9)`. However, I suspect it is one of the Python actors that is using more and more memory. I'd suggest looking at the Ray timeline to see if it looks as expected (documentation at https://ray.readthedocs.io/en/latest/user-profiling.html#visualizing-tasks-in-the-ray-timeline). – Robert Nishihara Apr 18 '19 at 19:02

2 Answers2

16

I recently ran into a similar problem and found that if you are frequently putting large objects (using ray.put()) that you need to either:

  1. Manually either adjust the thresholds that the python garbage collector uses

  2. Call the gc.collect() on a regular basis.

I implemented a method that checks the amount of used memory and then calls the garbage collector.

The problem is that the default thresholds are based upon the # of objects, but if you are putting large objects, the gc may never get called until you run out of memory. My utility method is as follows:

def auto_garbage_collect(pct=80.0):
    """
    auto_garbage_collection - Call the garbage collection if memory used is greater than 80% of total available memory.
                              This is called to deal with an issue in Ray not freeing up used memory.

        pct - Default value of 80%.  Amount of memory in use that triggers the garbage collection call.
    """
    if psutil.virtual_memory().percent >= pct:
        gc.collect()
    return

Calling this will solve the problem when it is related pushing large objects via ray.put() and running out of memory.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
Michael Wade
  • 205
  • 2
  • 5
  • Thank you for the example. I cannot edit this answer, but you will have to add 'import gc' and 'import psutil' some where. – troymyname00 Sep 10 '22 at 11:06
6

A quick fix is to use:

    ray.shutdown()

I code in Spyder which displays the percentage of memory used in the bottom right corner. When I run the same script multiple times, I noticed that the memory percentage value increased in increments of 3% (based on the 8 gigs RAM I have). This made me wonder if ray was storing something like a session due to the increments (each one corresponding to a session).

It turns out that it does.

ray.shutdown() ends the session. However, you need to call ray.init() again if you want to run your script again. Also, make sure you place this in the correct location as to not end ray while it is still needed.

This solves the problem of increasing memory usage with running a script several times.

I do not know Ray very well but, ray.init() has various arguments relating to addresses of sorts. I am sure there must be a way to make ray run on the same session via one of these arguments. This is speculation. I have not attempted any of this yet. Perhaps you can figure this out?

Seanny123
  • 8,776
  • 13
  • 68
  • 124
Dylan Solms
  • 330
  • 2
  • 10
  • 1
    I suppose turning your laptop off and on again gets you there as well:) These two workarounds are not qualitatively different. – mirekphd Aug 28 '22 at 17:25