I am trying to optimize the memory consumption of some data processing that I have to make on some large .csv files. I started using sys.getsizeof() to check whether it would be convenient to convert dataframes to lists or vice-versa (my data usually contains just numbers). Then I notice this: my list of 47 Pandas Dataframes occupies just 0.00052 MB. But the size of the single Dataframes is greater than the size of the list! Here is the code I am using:
path = 'D:\\folder_of_folders'
os.chdir(path)
fileList = glob.glob(path+'\\*') #Load all folders into a file list
dfList = [] #declare an empty list
final_df_columns = ['column_name1','column_name2'] #name of the columns to keep
for file in fileList:
try:
df = pd.read_csv(file+'\\filename.csv', usecols=final_df_columns) #read the addresses from the list into pandas dataframe
dfList.append(df) #append the dataframes into dfList (a list of data frames)
except FileNotFoundError :
print("File "+file+" not found")
print(str(sys.getsizeof(dfList)/1000000) + ' MB')
print(str(sys.getsizeof(dfList[0])/1000000) + ' MB')
print(str(sys.getsizeof(dfList[1])/1000000) + ' MB')
print(str(sys.getsizeof(dfList[46])/1000000) + ' MB')
and Output is: 0.00052 MB 4.349776 MB 4.356944 MB 0.313456 MB How is this possible? Clearly the actual memory occupied by dfList is not 0.00052 MB. How do I find the correct value? Thank you! Ok, recursive size is what I needed, seems obvious now... For now I am using the code from here:https://code.activestate.com/recipes/577504/ which seems to work fine.