Matrix completion in Python

Question

Say I have a matrix:

> import numpy as nap
> a = np.random.random((5,5))

array([[ 0.28164485,  0.76200749,  0.59324211,  0.15201506,  0.74084168],
       [ 0.83572213,  0.63735993,  0.28039542,  0.19191284,  0.48419414],
       [ 0.99967476,  0.8029097 ,  0.53140614,  0.24026153,  0.94805153],
       [ 0.92478   ,  0.43488547,  0.76320656,  0.39969956,  0.46490674],
       [ 0.83315135,  0.94781119,  0.80455425,  0.46291229,  0.70498372]])

And that I punch some holes in it with np.NaN, e.g.:

> a[(1,4,0,3),(2,4,2,0)] = np.NaN; 

array([[ 0.80327707,  0.87722234,         nan,  0.94463778,  0.78089194],
       [ 0.90584284,  0.18348667,         nan,  0.82401826,  0.42947815],
       [ 0.05913957,  0.15512961,  0.08328608,  0.97636309,  0.84573433],
       [        nan,  0.30120861,  0.46829231,  0.52358888,  0.89510461],
       [ 0.19877877,  0.99423591,  0.17236892,  0.88059185,        nan ]])

I would like to fill-in the nan entries using information from the rest of entries of the matrix. An example would be using the average value of the column where the nan entries occur.

More generally, are there any libraries in Python for matrix completion ? (e.g. something along the lines of Candes & Recht's convex optimization method).

Background:

This problem appears often in machine learning. For example when working with missing features in classification/regression or in collaborative filtering (e.g. see the Netflix Problem on Wikipedia and here)

score 12 · Accepted Answer · answered Aug 01 '13 at 18:34

If you install the latest scikit-learn, version 0.14a1, you can use its shiny new Imputer class:

>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(strategy="mean")
>>> a = np.random.random((5,5))
>>> a[(1,4,0,3),(2,4,2,0)] = np.nan
>>> a
array([[ 0.77473361,  0.62987193,         nan,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,         nan,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [        nan,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,         nan]])
>>> a = imp.fit_transform(a)
>>> a
array([[ 0.77473361,  0.62987193,  0.24346087,  0.11367791,  0.17633671],
       [ 0.68555944,  0.54680378,  0.24346087,  0.64186838,  0.15563309],
       [ 0.37784422,  0.59678177,  0.08103329,  0.60760487,  0.65288022],
       [ 0.51259188,  0.54097945,  0.30680838,  0.82303869,  0.22784574],
       [ 0.21223024,  0.06426663,  0.34254093,  0.22115931,  0.30317394]])

After this, you can use imp.transform to do the same transformation to other data, using the mean that imp learned from a. Imputers tie into scikit-learn Pipeline objects so you can use them in classification or regression pipelines.

If you want to wait for a stable release, then 0.14 should be out next week.

Full disclosure: I'm a scikit-learn core developer

But it fails to rows with all unknown values. Furthermore, is there a more advanced matrix completion method? For Imputer, it infers values based only on median, mean or frequent values. — lenhhoxung, Sep 06 '18 at 09:37

Daniel · Answer 2 · 2013-08-01T00:28:31.740

You can do it with pure numpy, but its nastier.

from scipy.stats import nanmean
>>> a
array([[ 0.70309466,  0.53785006,         nan,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,         nan,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [        nan,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,         nan]])


>>> mean=nanmean(a,axis=0)
>>> mean
array([ 0.42092677,  0.52158153,  0.56239323,  0.58094958,  0.41881841])
>>> index=np.where(np.isnan(a))

>>> a[index]=np.take(mean,index[1])
>>> a
array([[ 0.70309466,  0.53785006,  0.56239323,  0.49590115,  0.23521493],
       [ 0.29067786,  0.48236186,  0.56239323,  0.93220001,  0.76261019],
       [ 0.66243065,  0.07731947,  0.38887545,  0.56450533,  0.58647126],
       [ 0.42092677,  0.7870873 ,  0.60010096,  0.88778259,  0.09097726],
       [ 0.02750389,  0.72328898,  0.69820328,  0.02435883,  0.41881841]])

Running some timings:

import time
import numpy as np
import pandas as pd
from scipy.stats import nanmean

a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)


%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 1.84 s per loop

%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.81 s per loop

#Surprisingly, issue could be apply looping over the zero axis.
DF=pd.DataFrame(a2)
%timeit col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 5.57 s per loop

I do not believe numpy has array completion routines built in; however, pandas does. View the help topic here.

Adam Erickson · Answer 3 · 2018-08-01T01:21:11.267

The exact method you desire (Candes and Recht, 2008) is available for Python in the fancyimpute library, located here (link).

from fancyimpute import NuclearNormMinimization

# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN

X_filled_nnm = NuclearNormMinimization().complete(X_incomplete)

I've seen good results from it. Thankfully, they changed the autodiff and SGD backend from downhill, which uses Theano under the hood, to keras over the past year. The algorithm is available in this library too (link). SciKit-Learn's Imputer() does not include this algorithm. It's not in the documentation, but you can install fancyimpute with pip:

pip install fancyimpute

score 4 · Answer 4 · answered Jul 31 '13 at 23:55

4

You can do this quite simply with pandas

import pandas as pd

DF = pd.DataFrame(a)
col_means = DF.apply(np.mean, 0)
DF.fillna(value=col_means)

answered Jul 31 '13 at 23:55

Justin

42,475
9
93
111

Thanks. By the way, the documentation talks about `bfill`, `backfill`, `pad` and `ffill`. Where can I read more about these methods? (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html) – Amelio Vazquez-Reina Aug 01 '13 at 00:08
`bfill` is shorthand for `backfill` and `ffill` is "shorthand" for `pad`. I don't think there is much in the way of documentation but the code is [here](https://github.com/pydata/pandas/blob/fcaf9a666114cac67093823e6752b91391f84e1a/pandas/core/common.py) – Justin Aug 01 '13 at 00:17
In addition you should read up on pandas missing data help, [here](http://pandas.pydata.org/pandas-docs/dev/missing_data.html). – Daniel Aug 01 '13 at 00:27

score 2 · Answer 5 · edited May 23 '17 at 11:54

2

Similar questions have been asked here before. What you need is a special case of inpaiting. Unfortunately, neither numpy or scipy have builtin routines for this. However, OpenCV has a function inpaint(), but it only works on 8-bit images.

OpenPIV has a replace_nans function that you can use for your purposes. (See here for Cython version that you can repackage if you don't want to install the whole library.) It is more flexible than a pure mean or propagation of older values as suggested in other answers (e.g., you can defined different weighting functions, kernel sizes, etc.).

Using the examples from @Ophion, I compared the replace_nans with the nanmean and Pandas solutions:

import numpy as np
import pandas as pd
from scipy.stats import nanmean

a = np.random.random((10000,10000))
col=np.random.randint(0,10000,500)
row=np.random.randint(0,10000,500)
a[(col,row)]=np.nan
a1=np.copy(a)

%timeit new_array = replace_nans(a1, 10, 0.5, 1.)
1 loops, best of 3: 1.57 s per loop

%timeit mean=nanmean(a,axis=0);index=np.where(np.isnan(a));a[index]=np.take(mean,index[1])
1 loops, best of 3: 2.23 s per loop

%timeit DF=pd.DataFrame(a1);col_means = DF.apply(np.mean, 0);DF.fillna(value=col_means)
1 loops, best of 3: 7.23 s per loop

The replace_nans solution is arguably better and faster.

edited May 23 '17 at 11:54

Community

1
1

answered Aug 01 '13 at 12:56

tiago

22,602
12
72
88

Unless I am missing something `replace_nans` fills `nans` with a weighted average and will not be equivalent to replacing the `nans` with the average of the column. With 4 if statements inside 4 loops im not sure how much faster it will be if your array contains many nans. I would be very curious of the timings if you changed the number of nans to 5000 from 500. – Daniel Aug 01 '13 at 13:39
@Opion: you are right, it is not replacing the nans with the average of the column. But that was the point: the column average is not the best replacement. For curiosity I just re-ran the timings using `np.random.randint(0,10000,5000)` for `col` and `row`. `replace_nans` now took 1.55 s, and `nanmean` took 2.15 s. So, pretty similar... – tiago Aug 01 '13 at 14:00
Are you sure it has replace all `nans` in 10 iterations? I do apologize for being skeptical- the code just doesn't appear to be an efficient way of doing this at first glance. – Daniel Aug 01 '13 at 14:12
It does seem to have replaced all NaNs in 10 iterations. Just try for yourself. The point here is not finding the code that is faster at replacing the largest number of NaNs, but finding the best estimate for the missing values. Inpainting is not designed for images with a large proportion of NaNs. – tiago Aug 01 '13 at 15:00

Matrix completion in Python

Background:

5 Answers5

Linked