0

The goal is to calculate RMSE between two groups of columns in a pandas dataframe. The problem is that the amount of memory actually used is almost 10x the size of the dataframe. Here is the code I used to calculate RMSE:

import pandas as pd
import numpy as np
from random import shuffle

# set up test df (actual data is a pre-computed DF stored in HDF5)
dim_x, dim_y = 50, 1000000  # actual dataset dim_y = 56410949
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)  


# calculate rmse : https://stackoverflow.com/a/46349518
a = df.values
diffs = a[:,1:26] - a[:,26:27]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)

df['rmse_out'].to_pickle('results_rmse.p')

When I get the values from the df with a = df.values, the memory usage for that routine approaches 100GB according to top. The routine calculate the difference between these columns, diffs = a[:,1:26] - a[:,26:27], approaches 120GB then produces a Memory Error. How can I modify my code to make it more memory-efficient, avoid the error, and actually calculate my RMSE values?

Divakar
  • 218,885
  • 19
  • 262
  • 358
Thomas Matthew
  • 2,826
  • 4
  • 34
  • 58
  • One way would be to slice a number of rows and do it iteratively. Also, please add link from where you have picked up that code, because I remember seeing that code :) – Divakar Sep 26 '17 at 19:17
  • Cited you @Divakar – Thomas Matthew Sep 26 '17 at 19:18
  • 2
    How about : `s0,s1 = a[:,1:26] , a[:,26:27]; rmse_out = np.sqrt(ne.evaluate("sum((s0-s1)**2,1)")/3.0)`, where `ne` is external module : `import numexpr as ne`? – Divakar Sep 26 '17 at 19:34

1 Answers1

1

The solution I used was to chunk the dataframe down:

df = pd.read_hdf('madre_merge_sort32.h5')
for i,d in enumerate(np.array_split(df, 10)):
     d.to_pickle(str(i)+".p")

Then I ran through those pickled mini-dfs and calculated rmse in each:

for fn in glob.glob("*.p"):
    # process df values
    df = pd.read_pickle(fn)
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df.dropna(inplace=True)
    a= df[df.columns[2:]].as_matrix() # first two cols are non-numeric, so skip

    # calculate rmse
    diffs = a[:,:25] - a[:,25:]
    rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)

    df['rmse_out'] = rmse_out
    df.to_pickle("out"+fn)

Then I concatenated them:

dfls = []
for fn in glob.glob("out*.p"):
    df = pd.read_pickle(fn)
    dfls.append(df)
dfcat = pd.concat(dfls)

Chunking seemed to work for me.

Thomas Matthew
  • 2,826
  • 4
  • 34
  • 58