3

I am performing topic extraction on natural language data using NMF (aka NNMF) from scikit-learn. I am trying to optimize the number of clusters (aka components). In order to do this, I need to calculate the reconstruction error. However, using scikit-learn I only see a way to calculate this metric on the training set. But I am interested in getting these metrics for the testing set. Any suggestions?

user179041
  • 136
  • 1
  • 7

1 Answers1

4

It's easy to emulate sklearn's mechanisms on external-data.

This error-metric is calculated here using the function _beta_divergence(X, W, H, self.beta_loss, square_root=True).

The facts on how to get W, H are outlined in the API-docs.

Assuming we got sklearn >= 0.19 (where this was introduced), we can simply copy the usage.

Here is a full demo:

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.decomposition import NMF
from sklearn.decomposition.nmf import _beta_divergence  # needs sklearn 0.19!!!

""" Test-data """
bunch_train = fetch_20newsgroups_vectorized('train')
bunch_test = fetch_20newsgroups_vectorized('test')
X_train = bunch_train.data
X_test = bunch_test.data
X_train = X_train[:2500, :]  # smaller for demo
X_test = X_test[:2500, :]    # ...

""" NMF fitting """
nmf = NMF(n_components=10, random_state=0, alpha=.1, l1_ratio=.5).fit(X_train)
print('original reconstruction error automatically calculated -> TRAIN: ', nmf.reconstruction_err_)

""" Manual reconstruction_err_ calculation
    -> use transform to get W
    -> ask fitted NMF to get H
    -> use available _beta_divergence-function to calculate desired metric
"""
W_train = nmf.transform(X_train)
rec_error = _beta_divergence(X_train, W_train, nmf.components_, 'frobenius', square_root=True)
print('Manually calculated rec-error train: ', rec_error)

W_test = nmf.transform(X_test)
rec_error = _beta_divergence(X_test, W_test, nmf.components_, 'frobenius', square_root=True)
print('Manually calculated rec-error test: ', rec_error)

Output:

('original reconstruction error automatically calculated -> TRAIN: ', 37.326794668961604)
('Manually calculated rec-error train: ', 37.326816210011778)
('Manually calculated rec-error test: ', 37.019526486067413)

Remark: there is some tiny error probably induced by fp-math, but i'm too lazy to check where this comes from exactly. Smaller problems behave better and the problem above is huge, at least in terms of n_features.

Keep in mind, that this calculation and function used is some form decided on by the developers, which probably has a sound underlying theory. But in general i would say: As MF is all about reconstruction, you can build all the metrics you like based on the idea to compare: X_orig with nmf.inverse_transform(nmf.transform(X_orig)).

sascha
  • 32,238
  • 6
  • 68
  • 110
  • Thank you for the workaround/approximation, which worked for me at the time since I was doing a rapid prototype. I am going to suggest this functionality be added to the scikit-learn API, and hope to join as a committer. If you are interested in joining me, let me know, and I will send the link – user179041 Oct 14 '17 at 16:47