5

I am trying to build a recommendation system using Non-negative matrix factorization. Using scikit-learn NMF as the model, I fit my data, resulting in a certain loss(i.e., reconstruction error). Then I generate recommendation for new data using the inverse_transform method.

Now I do the same using another model I built in TensorFlow. The reconstruction error after training is close to that obtained using sklearn's approach earlier. However, neither are the latent factors similar to one another nor the final recommendations.

One difference between the 2 approaches that I am aware of is: In sklearn, I am using the Coordinate Descent solver whereas in TensorFlow, I am using the AdamOptimizer which is based on Gradient Descent. Everything else seems to be the same:

  1. Loss function used is the Frobenius Norm
  2. No regularization in both cases
  3. Tested on the same data using same number of latent dimensions

Relevant code that I am using:

1. scikit-learn approach:

model =  NMF(alpha=0.0, init='random', l1_ratio=0.0, max_iter=200, 
n_components=2, random_state=0, shuffle=False, solver='cd', tol=0.0001, 
verbose=0)
model.fit(data)
result = model.inverse_transform(model.transform(data))

2. TensorFlow approach:

w = tf.get_variable(initializer=tf.abs(tf.random_normal((data.shape[0], 
2))), constraint=lambda p: tf.maximum(0., p))
h = tf.get_variable(initializer=tf.abs(tf.random_normal((2, 
data.shape[1]))), constraint=lambda p: tf.maximum(0., p))
loss = tf.sqrt(tf.reduce_sum(tf.squared_difference(x, tf.matmul(w, h))))

My question is that if the recommendations generated by these 2 approaches do not match, then how can I determine which are the right ones? Based on my use case, sklearn's NMF is giving me good results, but not the TensorFlow implementation. How can I achieve the same using my custom implementation?

swathis
  • 336
  • 5
  • 17
  • 1
    One could write multiple pages about all those components, but let's just say: it's non-convex optimization and convergence (if happening) depends on initial-values (different local-minima possible). Without seeing code it's hard to grasp what you are doing exactly. (furthermore: without regularization, you probably don't achieve good results in the recommender-setting; also: most recommenders do not use NMF, so what's your reason to use it?) – sascha Mar 17 '18 at 15:16
  • 2
    @sascha - Modified the post to include code. I do realize multiple local minimas and that they are most likely not converging to the same point. However, I would like to understand how can I then achieve good results using a custom implementation. I understand for better results, regularization is necessary, but this is just a basic example and I want to get to comparable results using both approaches first. You mean collaborative filtering, content-based approaches? – swathis Mar 17 '18 at 15:38
  • Then research all the components, use the same initial-point, tune the optimizers to be more conservative/more local (not Adam; simple vanilla SGD; small steplength, many its). But i don't see any gain in doing that. What i meant in terms of alternatives was *low-rank matrix-factorization* with better rank-proxies (trace-norm or max-norm). So in short: a different loss, harder to optimize, but doable, even in large-scale (under some assumptions). – sascha Mar 17 '18 at 15:43
  • 1
    Already using same initial values, tried also with SGD and a range of hyper-parameters. Isn't NMF also a form of low-rank matrix factorization since the number of latent dimensions are very small compared to the original dimensions? – swathis Mar 18 '18 at 07:56

1 Answers1

2

The choice of the optimizer has a big impact on the quality of the training. Some very simple models (I'm thinking of GloVe for example) do work with some optimizer and not at all with some others. Then, to answer your questions:

  1. how can I determine which are the right ones ?

The evaluation is as important as the design of your model, and it is as hard i.e. you can try these 2 models and several available datasets and use some metrics to score them. You could also use A/B testing on a real case application to estimate the relevance of your recommendations.

  1. How can I achieve the same using my custom implementation ?

First, try to find a coordinate descent optimizer for Tensorflow and make sure all step you implemented are exactly the same as the one in scikit-learn. Then, if you can't reproduce the same, try different solutions (why don't you try a simple gradient descent optimizer first ?) and take profit of the great modularity that Tensorflow offers !

Finally, if the recommendations provided by your implementation are that bad, I suggest you have an error in it. Try to compare with some existing codes.

Robin
  • 1,531
  • 1
  • 15
  • 35
  • Already tried with SGD and there is not much of a difference in the final results as compared to using Adam (not considering the time for convergence). I also followed the exact same code linked by you and the above results are based on this. What do you mean by "take profit of the great modularity that Tensorflow offers" ? Could you please elaborate or be more specific? – swathis Mar 22 '18 at 11:02
  • By modularity I mean that you can easily change parts of your code (optimizer, constraints, loss, regularization etc.) – Robin Mar 26 '18 at 11:51