3

I have a flask app running in a container on EC2. On starting the container, the docker stats gave memory usage close to 48MB. After making the first API call (reading a 2gb file from s3), the usage rises to 5.72GB. Even after completion of the api call, the usage does not go down.

On hitting the request, the usage goes up by around twice the file size and after a few requests, the server starts giving the memory error

Also, on running the same Flask app without the container, we do not see any such increment in memory utilized.

Output of "docker stats <container_id>" before hitting the API-

Output of "docker stats <container_id>" after hitting the API

Flask app (app.py) contains-

import os
import json
import pandas as pd
import flask

app = flask.Flask(__name__)


@app.route('/uploadData', methods=['POST'])
def test():
    json_input = flask.request.args.to_dict()
    s3_path = json_input['s3_path']
    # reading file directly from s3 - without downloading
    df = pd.read_csv(s3_path)
    print(df.head(5))
    
    #clearing df
    df = None
    return json_input

@app.route('/healthcheck', methods=['GET'])
def HealthCheck():
    return "Success"

if __name__ == '__main__':
    app.run(host="0.0.0.0", port='8898')

Docker contains-

FROM python:3.7.10

RUN apt-get update -y && apt-get install -y python-dev

# We copy just the requirements.txt first to leverage Docker cache
COPY . /app_abhi
WORKDIR /app_abhi

EXPOSE 8898

RUN pip3 install flask boto3 pandas fsspec s3fs

CMD [ "python","-u", "app.py" ]

I tried reading the file directly from S3 as well as downloading the file and then reading it but it did not work.

Any leads in getting this memory utilization down to the initial consumption would be a great help!

Pranav Arora
  • 31
  • 1
  • 3
  • 1
    I am seeing a similar behavior. When my application is running inside a docker container, the memory behavior is radically different compared to the same application running outside docker container! I observe a 10-20 times increase in memory consumption! Did you solve the problem ? – Bashir Abdelwahed Aug 19 '22 at 14:13

3 Answers3

1

You can try following possible solutions:

  1. Update the dtype of the columns : Pandas (by default) try to infer dtypes of the datatype of columns when it creates a dataframe. Certain data types can result in large memory allocation. You can reduce it by updating the dtypes of such columns. e.g. update integer columns to pd.np.int8 and float columns to pd.np.float16. Refer this : Pandas/Python memory spike while reading 3.2 GB file

  2. Read data in Chunks : You can read data into a chunk size say and perform the required processing on the chunk and then moving on to the new chunk. This way you will not be storing the entire data into memory. Although reading data into chunks can be slower as compared to reading whole data at once, but it is memory efficient.

  3. Try using new library : Dask DataFrame is used in situations where Pandas is commonly needed, usually when Pandas fails due to data size or computation speed. But you might not find a lot of built-in pandas operations in Dask. https://docs.dask.org/en/latest/dataframe.html

0

The memory growth is almost certainly caused by constructing the dataframe.

df = None doesn't return that memory to the operating system, though it does return memory to the heap managed within the process. There's an explanation for that in How do I release memory used by a pandas dataframe?

Dave W. Smith
  • 24,318
  • 4
  • 40
  • 46
  • Hi Dave, Thank you for sharing the information. Going by this explanation, we should be observing the same memory hike while running the Flask app on a machine without the container but that was not the case. I missed mentioning this in the question my bad. – Pranav Arora Jan 31 '22 at 10:46
  • If you're running this as a CLI but aren't seeing memory growth across invocations, that's certainly interesting. – Dave W. Smith Jan 31 '22 at 22:26
0

I had a similar problem (see question Google Cloud Run: script requires little memory, yet reaches memory limit)

Finally, I was able to solve it by adding

import gc
...
gc.collect()
Maxwell86
  • 113
  • 1
  • 1
  • 10