1

As part of a project I'm working on, I need to calculate the mean squared error between 2m vectors.

Basically I two matrices x and xhat, both are size m by n and the vectors I'm interested in are the rows of these vectors.

I calculate the MSE with this code

def cost(x, xhat): #mean squared error between x the data and xhat the output of the machine
    return (1.0/(2 * m)) * np.trace(np.dot(x-xhat,(x-xhat).T))

It's working correctly, this formula is correct.

The problem is that in my specific case, my m and n are very large. specifically, m = 60000 and n = 785. So when I run my code and it enters this function, I get a memory error.

Is there a better way to calculate the MSE? I'd rather avoid for loops and I lean heavily towards matrix multiplication, but matrix multiplication seems extremely wasteful here. Maybe something in numpy I'm not aware of?

martineau
  • 119,623
  • 25
  • 170
  • 301
Oria Gruber
  • 1,513
  • 2
  • 22
  • 44

2 Answers2

5

The expression np.dot(x-xhat,(x-xhat).T) creates an array with shape (m, m). You say m is 60000, so that array is almost 29 gigabytes.

You take the trace of the array, which is just the sum of the diagonal elements, so most of that huge array is unused. If you look carefully at np.trace(np.dot(x-xhat,(x-xhat).T)), you'll see that it is just the sum of the squares of all the elements of x - xhat. So a simpler way to compute np.trace(np.dot(x-xhat,(x-xhat).T)) that doesn't require the huge intermediate array is ((x - xhat)**2).sum(). For example,

In [44]: x
Out[44]: 
array([[ 0.87167186,  0.96838389,  0.72545457],
       [ 0.05803253,  0.57355625,  0.12732163],
       [ 0.00874702,  0.01555692,  0.76742386],
       [ 0.4130838 ,  0.89307633,  0.49532327],
       [ 0.15929044,  0.27025289,  0.75999848]])

In [45]: xhat
Out[45]: 
array([[ 0.20825392,  0.63991699,  0.28896932],
       [ 0.67658621,  0.64919721,  0.31624655],
       [ 0.39460861,  0.33057769,  0.24542263],
       [ 0.10694332,  0.28030777,  0.53177585],
       [ 0.21066692,  0.53096774,  0.65551612]])

In [46]: np.trace(np.dot(x-xhat,(x-xhat).T))
Out[46]: 2.2352330441581061

In [47]: ((x - xhat)**2).sum()
Out[47]: 2.2352330441581061

For more ideas about computing the MSE, see the link provided by user1984065 in a comment.

Community
  • 1
  • 1
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
1

If you are going for performance as well, one alternative approach to compute sum of squared differences could be using np.einsum, like so -

subs = x-xhat
out = np.einsum('ij,ij',subs,subs)
Divakar
  • 218,885
  • 19
  • 262
  • 358