2

I have been using older versions of Ray and TensorFlow, but recently transitioned to the following most up-to-date versions on a Linux Ubuntu 20.04 setup.

ray==2.0.0
tensorflow==2.10.0
cuDNN==8.1
CUDA==11.2

While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ram_util_percent. My training session keeps crashing, and this behavior was not there with earlier versions of ray and tensorflow.

enter image description here

Below are the things I have tried so far:

  1. Set the argument reuse_actors = True in ray.tune.run()
  2. Limited object_store_memory to a certain amount, currently at 0.25 Gb
  3. According to this and this, set core file size to unlimited, and increased open files count

None of these methods have helped so far. As a temporary workaround, I am calling Python's garbage collector to free up unused memory when memory usage reaches 80%. I am not sure if this method will continue to mitigate the issue if I go for higher time steps of training; my guess is no.

def collectRemoveMemoryGarbage(self, percThre = 80.0):
    """
    :param percThre: PERCENTAGE THRESHOLD FLOATING POINT VALUE
    :return: N/A
    """
    if psutil.virtual_memory().percent >= percThre:
        _ = gc.collect()

Does anyone know a better approach to this problem? I know this is a well discussed problem in the ray GitHub issues page. This might end up being an annoying bug in ray or tensorflow, but I am looking for feedback from others who are well-versed in this area.

troymyname00
  • 670
  • 1
  • 14
  • 32

1 Answers1

1

Increasing the checkpoint_freq argument within ray.tune.run() helped me achieve 5e6 time steps without any crash due to running out of memory; previously, it was 10, now it's 50.

It seems that not check-pointing frequently does the trick.

enter image description here

I will try out higher number of time steps.

troymyname00
  • 670
  • 1
  • 14
  • 32
  • I'm not entirely able to comprehend how increasing checkpoint frequency solves this issue in particular but I will try it out! Thank you – hridayns Sep 17 '22 at 19:51
  • 1
    Another thing that is important to note is that it's not one particular method, but a combination of methods that seem to support with completing longer time steps of training without crashes due to memory shortage. – troymyname00 Sep 20 '22 at 09:52
  • Hello, I tried increasing checkpoint frequency but as you advised, it seems that a number of methods have to be applied for it to work for longer training times. I also tried increasing evaluation_sample_timeout_s to see if that changed anything. Could you list a few that you tried and worked for you? Would really appreciate it. – hridayns Sep 20 '22 at 12:46
  • 1
    Hi, sorry for the late response. I was trying out a few different things. One of the things I recently tried was call a function from the "reset" function of my environment class, and that function is essentially similar to the one I mentioned above, but a little modified. I think this method was very helpful since it helped me keep my swap memory clean and keep my RAM usage below 80% in the Linux system where I am running the training sessions. Please check out the gist below for the function. Please check out this gist: https://gist.github.com/troymyname/a0128e0c4dd7e09795a839727fcf56f0 – troymyname00 Sep 26 '22 at 09:29
  • One caveat with the method above that you may have to type in your super user password from time to time because without the password, turning off swap memory is not allowed. I am trying to figure out how to automate that. – troymyname00 Sep 26 '22 at 09:53
  • Hello again @troymyname00, I am using that snippet now, but without the swapping part. Hoping that works, since it is supposed to explicitly make Python complete garbage collection whenever RAM usage goes above a certain % - as per my understanding? – hridayns Sep 29 '22 at 12:10