0

Problem

I have a Python application inside Docker container. The application receives "jobs" from some queue service (RabbitMQ), does some computing tasks and uploads results into database (MySQL and Redis).

The issue I face is - the RAM is not properly "cleaned up" between iterations and thus memory consumption between iterations raises until OOM. Since I have implemented MemoryError (see tested solutions below for more info), the container stays alive and the memory keeps exhausted (not freed up by container restart).


Question

  • How to debug what is "staying" in the memory so I can clean it up?
  • How to cleanup the memory properly between runs?

Iteration description

An example of increasing memory utilisation; memory limit set to 3000 MiB

  • fresh container: 130 MiB
  • 1st iteration: 1000 MiB
  • 2nd iteration: 1500 MiB
  • 3rd iteration: 1750 MiB
  • 4th iteration: OOM

Note: Every run/iteration is a bit different and thus has a bit different memory requirements, but the pattern stays similar.


Below is a brief overiew of the iteraion which might be helpful while determining what might be wrong

  1. Receiving job parameters from rabbitmq
  2. Loading data from local parquet into dataframe (using read_parquet(filename, engine="fastparquet"))
  3. Computing values using Pandas functions and other libraries (most of the laod is probably here)
  4. Converting dataframe to dictionary and computing some other values inside a loop
  5. Adding some more metrics from computed values - e.g. highest/lowest values, trends etc.
  6. Storing metrics from 5. in database (MySQL and Redis)

A selection of the tech I use

  • Python 3.10
  • Pandas 1.4.4
  • numpy 1.24.2
  • running in AWS ECS Fargate (but results on local are similar); 1 vCPU and 8 GB or memory

Possible solutions / tried approaches

  • ❌: tried; not worked
  • : and idea I am going to test
  • : did not completely solved the problem, but helped towards the solution
  • ✅: working solution

❌ Restart container after every iteration

The most obvious one is to restart the docker container (e.g. using exit() and causing container to restart itself) after every iteration. This solution is not feasible, because the size of "restart overhead" is too big (one run takes 15 - 60 seconds and thus the restart will slow things soo much).

❌ Using gc.collect()

I have tried to call gc.collect() at the very beginning of each iteration, but the memory usage did not change at all.

✅ Test multiprocessing

I read some recommendations to use multiprocessing module in order to improve memory efficiency, because it will "drop" all resources after subprocess finishes.

This solved the issue, see answers below.

https://stackoverflow.com/a/1316799/12193952

Use explicit del on unwanted objects

The idea is to explicitly delete objects that are not longer used (e.g. dataframe after it's converted to dictionary).

del my_array
del my_object

https://stackoverflow.com/a/1316793/12193952

Monitor memory using psutil

import psutil
# Local imports
from utils import logger


def get_usage():
    total = round(psutil.virtual_memory().total / 1000 / 1000, 4)
    used = round(psutil.virtual_memory().used / 1000 / 1000, 4)
    pct = round(used / total * 100, 1)
    logger.info(f"Current memory usage is: {used} / {total} MB ({pct} %)")

    return True

Support except MemoryError

Thanks to this question I was able to set up try/except pattern that catches OOM errors and keep the container running (so logs are available etc.).


Even if I don't get any answer, I will continue testing and editing until I find a solution and hopefully help someone else.

FN_
  • 715
  • 9
  • 27
  • You didn't show code which is doing the computations and/or saving to the db. In genral in languages with GC it's not recommended to try to free the memory on your own, since there are number of GC policies and there is no guarantee when stuff is being really removed from memory. To have such control usage of language with explicit memory management would make sense imho (like C++). It seems some data structure is in wrong namespace, thus is not being handled correctly / is inside a closure maybe and gets bloated with each job. – Gameplay Mar 10 '23 at 11:52
  • Unfortunately there is 20k+ lines of code doing the computation and other tasks, so I was not sure which parts might be useful. Also I agree that it's difficult to guess what is wrong without seeing the code. I can share snippets based on your suggestions. So you suggest rewriting the app into some language with better memory control? Can u elaborate what do u mean by "wrong namespace"? Thanks – FN_ Mar 10 '23 at 11:59

1 Answers1

0

It seems like implementing multiprocessing solved the issue.

Below is the code snipped explaining the implementaion - but it's very very simple.

import multiprocessing


def callback():
    ...
    # Run the strategy test
    p = multiprocessing.Process(target=run_test, args=(body,))
    p.start()
    p.join()

I was able to mitigate the number of failed tests due to OOM from 86 % to 0 %. Local testing results are following:

  • fresh container: 152 MiB
  • 1st iteration: 162 MiB
  • 2nd iteration: 370 MiB
  • 3rd iteration: 371 MiB
  • 4th iteration: 371 MiB
  • 5th iteration: 371 MiB
  • 6th iteration: 371 MiB
  • 7th iteration: 371 MiB
  • 8th iteration: 371 MiB
FN_
  • 715
  • 9
  • 27