sklearn
's implementation of NMF
does not seem to support missing values (Nan
s, here 0 values basically represent unknown ratings corresponding to new users), refer to this issue. However, we can use suprise
's NMF
implementation, as shown in the following code:
import numpy as np
import pandas as pd
from surprise import NMF, Dataset, Reader
R = np.array([
[5,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
], dtype=np.float)
R[R==0] = np.nan
print(R)
# [[ 5. 3. nan 1.]
# [ 4. nan nan 1.]
# [ 1. 1. nan 5.]
# [ 1. nan nan 4.]
# [nan 1. 5. 4.]]
df = pd.DataFrame(data=R, index=range(R.shape[0]), columns=range(R.shape[1]))
df = pd.melt(df.reset_index(), id_vars='index', var_name='items', value_name='ratings').dropna(axis=0)
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df[['index', 'items', 'ratings']], reader)
k = 2
algo = NMF(n_factors=k)
trainset = data.build_full_trainset()
algo.fit(trainset)
predictions = algo.test(trainset.build_testset()) # predict the known ratings
R_hat = np.zeros_like(R)
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
predictions = algo.test(trainset.build_anti_testset()) # predict the unknown ratings
for uid, iid, true_r, est, _ in predictions:
R_hat[uid, iid] = est
print(R_hat)
# [[4.40762528 2.62138084 3.48176319 0.91649316]
# [3.52973408 2.10913555 2.95701406 0.89922637]
# [0.94977826 0.81254138 4.98449755 4.34497549]
# [0.89442186 0.73041578 4.09958967 3.50951819]
# [1.33811051 0.99007556 4.37795636 3.53113236]]
The NMF implementation is as per the [NMF:2014] paper as described here and shown below:

Note that, here the optimization is performed using the known ratings only, resulting in the predicted values of the known ratings being close to the true ratings (but the predicted values for the unknown ratings are not in general close to 0
, as expected).
Again, as usual, we can find the number of factors k
using cross-validation.