The goal is to calculate RMSE between two groups of columns in a pandas dataframe. The problem is that the amount of memory actually used is almost 10x the size of the dataframe. Here is the code I used to calculate RMSE:
import pandas as pd
import numpy as np
from random import shuffle
# set up test df (actual data is a pre-computed DF stored in HDF5)
dim_x, dim_y = 50, 1000000 # actual dataset dim_y = 56410949
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)
# calculate rmse : https://stackoverflow.com/a/46349518
a = df.values
diffs = a[:,1:26] - a[:,26:27]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'].to_pickle('results_rmse.p')
When I get the values from the df with a = df.values
, the memory usage for that routine approaches 100GB according to top. The routine calculate the difference between these columns, diffs = a[:,1:26] - a[:,26:27]
, approaches 120GB then produces a Memory Error. How can I modify my code to make it more memory-efficient, avoid the error, and actually calculate my RMSE values?