0

I'm trying to optimize the memory usage of a Python script. The following function, in particular, uses a lot of memory:

def fn_merge(df1, df2):
    return pd.\
        concat([
            df1,
            df2.\
                query(
                    "aderencia_disciplina > 0 and "
                    "interesse == False"
                ).\
                astype({"cod_docente": "int64"}).\
                merge(
                    df1.\
                    astype({"cod_docente": "int64"}).\
                    drop(["discod", "interesse", "aderencia_disciplina"], axis=1).\
                    drop_duplicates(),
                    on=["iunicodempresa", "cod_docente"],
                    how="left"
            )
        ])

df1 and df2 have 1.7 and 9.9 Mb, respectively. The issue is that the function seems to use a portion of memory and never "lets it go". If I execute it, say, ~20 times, RAM usage goes from ~2 to 8 Gb, and never drops. Does anyone know what's happening? I thought all the memory used within the function would be freed after it finished executing. Any help appreciated.

Lfppfs
  • 86
  • 3
  • 8
  • df1 and df2 refer to objects in the calling function and therefore still exist – DarkKnight Sep 27 '22 at 11:57
  • @Vlad `df1` OK, `df1.astype().drop().drop_duplicates()` etc. create temporary objects which can cease to exist after the execution. – norok2 Sep 27 '22 at 11:58
  • @Vlad I think df1 and df2 are deleted after the function is executed, at least that's what I understand from reading [this post](https://pythonspeed.com/articles/function-calls-prevent-garbage-collection/) – Lfppfs Sep 27 '22 at 12:02
  • @norok2 calling gc.collect() makes no difference at all, I've tried that already – Lfppfs Sep 27 '22 at 12:03
  • @Lfppfs What makes you think they're deleted? They would need to be explicitly deleted (*del*) in the calling function or, at least, go out of scope which would make them candidates for garbage collection – DarkKnight Sep 27 '22 at 12:04
  • Have you deleted `df1` and `df2` explicitly? – norok2 Sep 27 '22 at 12:08
  • I don't delete them in the calling function because I have to use them later. But if I keep calling fn_merge repeatedly, the memory increases much more than if I call it just once. It seems to me that pandas is creating some copies within it that it is not able to delete afterwards. – Lfppfs Sep 27 '22 at 12:11
  • @Lfppfs What do you do with the return value from *fn_merge()* ? – DarkKnight Sep 27 '22 at 12:13
  • I use them for many operations, they are the outputs of my script.Btw, when I said above that I think df1 and df2 are deleted, I meant they are deleted locally, after fn_merge() is executed, and that's why I don't understand why calling the function repeatedly increases memory usage. – Lfppfs Sep 27 '22 at 12:15
  • Can you make your example reproducible? and indicate how are you measuring the memory? Also, have you tried a different version of Pandas? – norok2 Sep 27 '22 at 12:20

1 Answers1

1

See this, it is caused by malloc_trim and libc.so.6.

https://github.com/pandas-dev/pandas/issues/2659

Memory leak using pandas dataframe

I used to have this problem when reading 2GB csv files, so I didn't use pandas and solved it using "with open("filename", "r") as f" and writing a function myself.

Hope it can help you.

謝咏辰
  • 37
  • 4