3

The page https://pypi.python.org/pypi/fancyimpute has the line

# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
X_filled_softimpute = SoftImpute().complete(X_incomplete_normalized)

which kind of suggests that I need to normalize the input data. However I did not find any details on the internet, what exactly is meant by that. Do I have to normalize my data beforehand and what exactly is expected?

Miriam Farber
  • 18,986
  • 14
  • 61
  • 76
Make42
  • 12,236
  • 24
  • 79
  • 155

1 Answers1

1

Yes you should definitely normalize the data. Consider the following example:

from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(100,0.5,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)

The result is

array([[  81.78428587,   99.69638878,  100.67626769],
       [  99.82026281,  100.09077899,   99.50273223],
       [  99.70946085,   70.98619873,   69.57668189],
       [  81.82898539,   99.66269922,  100.95263318],
       [  99.14285815,  100.10809651,   99.73870089]])

Note that the places where I put nan are completely off. However, if instead you run

from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(0,1,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)

(same code as before, the only difference is that v is normalized) you get the following reasonable result:

array([[ 0.07705556, -0.53449412, -0.20081351],
       [ 0.9709198 , -1.19890962, -0.25176222],
       [ 0.41839224, -0.11786451,  0.03231515],
       [ 0.21374759, -0.66986997,  0.78565414],
       [ 0.30004524,  1.28055845,  0.58625942]])

Thus, when you are using SoftImpute, don't forget to normalize your data (you can do that by making the mean of every column to be 0, and the std to be 1).

Miriam Farber
  • 18,986
  • 14
  • 61
  • 76