2

I am facing a memory issue with Python and Pandas,

The code is quite simple,

for i in range(5):
    df = db_controller.read_from_database(i)   
    print(df)

df is a pandas Dataframe read from a database, each iteration increases the resilient memory in ~1Gb, all the iterations retrieves exaclty the same data from the database. Under my point of view, for each iteration, the resilient memory should not increase, as the variable df goes out of scope (in the new iteration). The result is that after some iterations, the resilient memory increases up to 12 Gb and I get the error OutOfMemory.

I have tried to force the garbage collector:

for i in range(5):
    df = db_controller.read_from_database(i)   
    print(df)
    del df
    gc.collect()

The result is that each time the garbage collector is called, around 30Mb is released from the resilient memory, but it can not releases 1Gb as it should.

Could anyone help me?, how can I release completely the df DataFrame after each iteration?

  • I have also tried removing db_controller:

    from pyathenajdbc import connect
    import pandas as pd    
    
    for i in range(5):
        query = "select * from events.common_events limit 20000"
    
        conn = connect(s3_staging_dir=amazon_constants.AMAZON_S3_TABLE_STAGING_DIR,
                   region_name=amazon_constants.AMAZON_REGION)
        df = pd.DataFrame()
        try:
            df = pd.read_sql(query, conn)
        finally:
            conn.close()
    
        print(df)
        del df
        gc.collect()
    
bracana
  • 1,003
  • 2
  • 11
  • 20
  • I guess, there is no effect about memory, but did you try `df = None` instead of `del df`? – Alperen Sep 20 '17 at 12:08
  • yes, I have tried this too, but same effect – bracana Sep 20 '17 at 12:28
  • How do you know it's pandas? I'd tend to think it's db_controller not releasing memory but you haven't provided any info on db_controller so it's difficult to say. db_controller must be an instance of something, but what? – JohnE Sep 20 '17 at 12:48
  • Thanks for your help @JohnE , I have removed the call to the db_controller, look what I have modified. Still the same result... – bracana Sep 20 '17 at 13:00

1 Answers1

1

I didn't try, but this should work:

from multiprocessing import Pool

def read_func():
    df = db_controller.read_from_database(i) 
    print(df)  

pool = Pool()
for i in range(5):
    pool.map(read_func)
    pool.close()
    pool.join()

Because, multiprocessing is at OS level, not related with pandas.

Alperen
  • 3,772
  • 3
  • 27
  • 49
  • Thank you for your help @Alperen, I would like to find a solution that does not implies make my application multithread. – bracana Sep 20 '17 at 13:04
  • Then, I will suggest you to try [this](https://stackoverflow.com/a/39377643/6900838) and [this](https://stackoverflow.com/a/31888262/6900838) – Alperen Sep 20 '17 at 13:14
  • @user1666191 Could you just try and tell me if it works or not? I'm curious. – Alperen Sep 21 '17 at 05:38
  • I have tried your multiprocessing solution, it works but I really would like to find another solution. Thank you very much anyway! – bracana Sep 22 '17 at 08:11
  • You're welcome, I'm glad that it works. I'm not experienced enough in pandas to tell solutions without trying my code. I don't have your db, your files, etc. So, I can't try. You have all of those. Try the links in my first comment. – Alperen Sep 22 '17 at 08:31