I am facing a memory issue with Python and Pandas,
The code is quite simple,
for i in range(5):
df = db_controller.read_from_database(i)
print(df)
df is a pandas Dataframe read from a database, each iteration increases the resilient memory in ~1Gb, all the iterations retrieves exaclty the same data from the database. Under my point of view, for each iteration, the resilient memory should not increase, as the variable df goes out of scope (in the new iteration). The result is that after some iterations, the resilient memory increases up to 12 Gb and I get the error OutOfMemory.
I have tried to force the garbage collector:
for i in range(5):
df = db_controller.read_from_database(i)
print(df)
del df
gc.collect()
The result is that each time the garbage collector is called, around 30Mb is released from the resilient memory, but it can not releases 1Gb as it should.
Could anyone help me?, how can I release completely the df DataFrame after each iteration?
I have also tried removing db_controller:
from pyathenajdbc import connect import pandas as pd for i in range(5): query = "select * from events.common_events limit 20000" conn = connect(s3_staging_dir=amazon_constants.AMAZON_S3_TABLE_STAGING_DIR, region_name=amazon_constants.AMAZON_REGION) df = pd.DataFrame() try: df = pd.read_sql(query, conn) finally: conn.close() print(df) del df gc.collect()