5

I am implementing Probabilistic Matrix Factorization models in theano and would like to make use of Adam gradient descent rules.

My goal is to have a code that is as uncluttered as possible, which means that I do not want to keep explicitly track of the 'm' and 'v' quantities from Adam's algorithm.

It would seem that this is possible, especially after seeing how Lasagne's Adam is implemented: it hides the 'm' and 'v' quantities inside the update rules of theano.function.

This works when the negative log-likelihood is formed with each term that handles a different quantity. But in Probabilistic Matrix Factorization each term contains a dot product of one latent user vector and one latent item vector. This way, if I make an instance of Lasagne's Adam on each term, I will have multiple 'm' and 'v' quantities for the same latent vector, which is not how Adam is supposed to work.

I also posted on Lasagne's group, actually twice, where there is some more detail and some examples.

I thought about two possible implementations:

  1. each existing rating (which means each existing term in the global NLL objective function) has its own Adam updated via a dedicated call to a theano.function. Unfortunately this lead to a non correct use of Adam, as the same latent vector will be associated to different 'm' and 'v' quantities used by the Adam algorithm, which is not how Adam is supposed to work.
  2. Calling Adam on the entire objective NLL, which will make the update mechanism like simple Gradient Descent instead of SGD, with all the disadvantages that are known (high computation time, staying in local minima etc.).

My questions are:

  • maybe is there something that I did not understand correctly on how Lasagne's Adam works?

  • Would the option number 2 be actually like SGD, in the sense that every update to a latent vector is going to make an influence to another update (in the same Adam call) that make use of that updated vector?

  • Do you have any other suggestions on how to implement it?

Any idea on how to solve this issue and avoid manually keeping replicated vectors and matrices for the 'v' and 'm' quantities?

fstab
  • 4,801
  • 8
  • 34
  • 66
  • Seem like this Q would be more at home at http://stats.stackexchange.com/ – redcalx Sep 16 '16 at 16:24
  • @redcalx: I disagree, the problem here lies only in how to *implement* Adam. – fstab Sep 19 '16 at 12:02
  • I didn't read the PMF in detail and I'm not a Lasagne user, but is it possible to express your optimization goal as a single theano function of your model parameters? If so, then this can definitely be done in pure theano. – Kh40tiK Jan 13 '17 at 11:41

1 Answers1

0

It looks like in the paper they are suggesting that you optimise the entire function at once using gradient descent:

We can then perform gradient descent in Y , V , and W to minimize the objective function given by Eq. 10.

So I would say your option 2 is the correct way to implement what they have done.

There aren't many complexities or non-linearities in there (beside the sigmoid) so you probably aren't that likely to run into typical optimisation problems associated with neural networks that necessitate the need for something like Adam. So as long as it all fits in memory I guess this approach will work.

If it doesn't fit in memory, perhaps there's some way you could devise a minibatch version of the loss to optimise over. Would also be interested to see if you could add one or more neural networks to replace some of these conditional probabilities. But that's getting a bit off-topic...

nlml
  • 1,395
  • 12
  • 19