0

I am trying to optimize the memory consumption of some data processing that I have to make on some large .csv files. I started using sys.getsizeof() to check whether it would be convenient to convert dataframes to lists or vice-versa (my data usually contains just numbers). Then I notice this: my list of 47 Pandas Dataframes occupies just 0.00052 MB. But the size of the single Dataframes is greater than the size of the list! Here is the code I am using:


    path = 'D:\\folder_of_folders'
    os.chdir(path)
    fileList = glob.glob(path+'\\*') #Load all folders into a file list
    dfList = [] #declare an empty list
    final_df_columns = ['column_name1','column_name2']  #name of the columns to keep
    for file in fileList:
       try: 
            df = pd.read_csv(file+'\\filename.csv', usecols=final_df_columns) #read the addresses from the list into pandas dataframe
            dfList.append(df) #append the dataframes into dfList (a list of data frames)
        except FileNotFoundError : 
            print("File "+file+" not found")
    print(str(sys.getsizeof(dfList)/1000000) + ' MB')
    print(str(sys.getsizeof(dfList[0])/1000000) + ' MB')
    print(str(sys.getsizeof(dfList[1])/1000000) + ' MB')
    print(str(sys.getsizeof(dfList[46])/1000000) + ' MB')

and Output is: 0.00052 MB 4.349776 MB 4.356944 MB 0.313456 MB How is this possible? Clearly the actual memory occupied by dfList is not 0.00052 MB. How do I find the correct value? Thank you! Ok, recursive size is what I needed, seems obvious now... For now I am using the code from here:https://code.activestate.com/recipes/577504/ which seems to work fine.

CristianG
  • 1
  • 1
  • from the documentation `Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.` so when applied to the list, it will return the size of list itself without the contents. try [pympler.asizeof](https://pympler.readthedocs.io/en/stable/library/asizeof.html#pympler.asizeof.asizeof) – Nullman May 17 '22 at 11:39

1 Answers1

0

sys.getsizeof is not deep/recursive so you only get the size of the list object not of it's contents

As answered here: Deep version of sys.getsizeof

The https://pympler.readthedocs.io/en/latest/library/asizeof.html lib might help:

from pympler.asizeof import asizeof
#[your code here]
print(asize(dfList)/1000000 + ' MB')

See this answer for more options: How do I measure the memory usage of an object in python?

Cristian Dumitru
  • 321
  • 1
  • 3
  • 5