1

Given a pandas DataFrame as below:

import pandas as pd
from sklearn.metrics import mean_squared_error

    df = pd.DataFrame.from_dict(  
         {'row': ['a','b','c','d','e','y'],
            'a': [ 0, -.8,-.6,-.3, .8, .01],
            'b': [-.8,  0, .5, .7,-.9, .01],
            'c': [-.6, .5,  0, .3, .1, .01],
            'd': [-.3, .7, .3,  0, .2, .01],
            'e': [ .8,-.9, .1, .2,  0, .01],
            'y': [ .01, .01, .01, .01,  .01, 0],
       }).set_index('row')
df.columns.names = ['col']

I want to create a new column of RMSE values (from scikit-learn) using specific columns for the arguments. Namely, the columns y_true = df['a','b','c'] vs y_pred = df['x','y','x']. This was easy to do using an iterative approach:

for tup in df.itertuples():
    df.at[tup[0], 'rmse']  = mean_squared_error(tup[1:4], tup[4:7])**0.5

And that gives the desired result:

col     a     b     c     d     e     y      rmse
row                                              
a    0.00 -0.80 -0.60 -0.30  0.80  0.01  1.003677
b   -0.80  0.00  0.50  0.70 -0.90  0.01  1.048825
c   -0.60  0.50  0.00  0.30  0.10  0.01  0.568653
d   -0.30  0.70  0.30  0.00  0.20  0.01  0.375988
e    0.80 -0.90  0.10  0.20  0.00  0.01  0.626658
y    0.01  0.01  0.01  0.01  0.01  0.00  0.005774

But I want a higher-performance solution, possibly using vectorization, since my dataframe has shape (180000000, 52). I also dislike indexing by tuple position rather than by column name. The attempt below:

df['rmse'] = df.apply(mean_squared_error(df[['a','b','c']], df[['d','e','y']])**0.5, axis=1)

Gets the error:

TypeError: ("'numpy.float64' object is not callable", 'occurred at index a')

So what am I doing wrong with my use of df.apply()? Does this even maximize performance over iteration?

Testing Performance

I've tested the wall times for each of the first two respondants using the below test df:

# set up test df
dim_x, dim_y = 50, 1000000
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
shuffle(cols)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)  #, index=idx, columns=cols
a = df.values

# define column samples
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

c0 = [s for s in cols if "a" in s]
c1 = [s for s in cols if "b" in s]
s0 = a[:,column_index(df, c0)]
s1 = a[:,column_index(df, c1)]

The results are as follows:

%%time
# approach 1 - divakar
rmse_out = np.sqrt(((s0 - s1)**2).mean(1))
df['rmse_out'] = rmse_out

Wall time: 393 ms

%%time
# approach 2 - divakar
diffs = s0 - s1
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'] = rmse_out

Wall time: 228 ms

%%time
# approach 3 - divakar
diffs = s0 - s1
rmse_out = np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
         np.einsum('ij,ij->i',s1,s1) - \
       2*np.einsum('ij,ij->i',s0,s1))/3.0)
df['rmse_out'] = rmse_out

Wall time: 421 ms

The solution using the apply function is still running after several minutes...

Thomas Matthew
  • 2,826
  • 4
  • 34
  • 58

2 Answers2

2

Approach #1

One approach for performance would be to use the underlying array data alongwith NumPy ufuncs, alongwith slicing those two blocks of columns to use those ufuncs in a vectorized manner, like so -

a = df.values
rmse_out = np.sqrt(((a[:,0:3] - a[:,3:6])**2).mean(1))
df['rmse_out'] = rmse_out

Approach #2

Alternative faster way to compute the RMSE values with np.einsum to replace the squared-summation -

diffs = a[:,0:3] - a[:,3:6]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)

Approach #3

Another way to compute rmse_out using the formula :

(a - b)^2 = a^2 + b^2 - 2ab

would be to extract the slices :

s0 = a[:,0:3]
s1 = a[:,3:6]

Then, rmse_out would be -

np.sqrt(((s0**2).sum(1) + (s1**2).sum(1) - (2*s0*s1).sum(1))/3.0)

which with einsum becomes -

np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
         np.einsum('ij,ij->i',s1,s1) - \
       2*np.einsum('ij,ij->i',s0,s1))/3.0)

Getting respective column indices

If you are not sure whether the columns a,b,.. would be in that order or not, we could find those indices with column_index.

Thus a[:,0:3] would be replaced by a[:,column_index(df, ['a','b','c'])] and a[:,3:6] by a[:,column_index(df, ['d','e','y'])].

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks for your solutions! I did a wall time comparsion of your three solutions and added it to my original post. – Thomas Matthew Sep 22 '17 at 19:23
  • @ThomasMatthew Awesome! Good to see those timings. – Divakar Sep 22 '17 at 19:25
  • Getting a `MemoryError` at larger DF sizes, which fails at the equivalent of `a[:,column_index(df, ['d','e','y'])]`. Is there another way to accomplish approach 2 directly from the dataframe rather than creating a copy of the values with `a=df.values` which may be taking up more memory than needed? – Thomas Matthew Sep 25 '17 at 04:25
  • @ThomasMatthew With all floating point values in the input dataframe, `df.values` would just be a view into the input dataframe and not a copy. Additionally let me ask you - Would cols - `a,b,c` always be column numbers `0,1,2` and `d,e,y` as `3,4,5` with 0-based indexing? – Divakar Sep 25 '17 at 04:45
  • @Divakar Yes, I can know the column numbers ahead of time, and they should always be on 0-based indexing – Thomas Matthew Sep 25 '17 at 05:28
  • @ThomasMatthew Well I meant if the column numbers are in sequence like `0,1,2` for the first slice. then as mentioned at the start of my post, use : `a[:,0:3]` instead of `a[:,column_index(df, c0)]` and similarly for the second slice. `a[:,column_index(df, c0)]` was forcing a copy there, whereas `a[:,0:3]` would be a view. – Divakar Sep 25 '17 at 05:30
  • @Divakar I understand what you're saying now. I can preprocess the dataframe so each group is contiguous, and can therefore use a slicing view. – Thomas Matthew Sep 25 '17 at 05:35
  • @ThomasMatthew You got it. – Divakar Sep 25 '17 at 05:35
  • @Divakar memory error persists at the step `diffs = a[:,1:26] - a[:,26:27]`. Any suggestions on how to chunk this operation? – Thomas Matthew Sep 25 '17 at 22:37
  • @ThomasMatthew That looks like the best we could do I am afraid. – Divakar Sep 25 '17 at 22:50
2

The df.apply approach:

df['rmse'] = df.apply(lambda x: mean_squared_error(x[['a','b','c']], x[['d','e','y']])**0.5, axis=1)

col     a     b     c     d     e     y      rmse
row                                              
a    0.00 -0.80 -0.60 -0.30  0.80  0.01  1.003677
b   -0.80  0.00  0.50  0.70 -0.90  0.01  1.048825
c   -0.60  0.50  0.00  0.30  0.10  0.01  0.568653
d   -0.30  0.70  0.30  0.00  0.20  0.01  0.375988
e    0.80 -0.90  0.10  0.20  0.00  0.01  0.626658
y    0.01  0.01  0.01  0.01  0.01  0.00  0.005774
Alexey Trofimov
  • 4,287
  • 1
  • 18
  • 27