13

I have been looking at autoencoders and have been wondering whether to used tied weights or not. I intend on stacking them as a pretraining step and then using their hidden representations to feed a NN.

Using untied weights it would look like:

f(x)=σ2(b2+W21(b1+W1*x))

Using tied weights it would look like:

f(x)=σ2(b2+W1T1(b1+W1*x))

From a very simplistic view, could one say that tying the weights ensures that encoder part is generating the best representation given the architecture versus if the weights were independent then decoder could effectively take a non-optimal representation and still decode it?

I ask because if the decoder is where the "magic" occurs and I intend to only use the encoder to drive my NN, wouldn't that be problematic.

Ameet Deshpande
  • 496
  • 8
  • 22
Paul O
  • 425
  • 5
  • 19

2 Answers2

15

Autoencoders with tied weights have some important advantages :

  1. It's easier to learn.
  2. In linear case it's equvialent to PCA - this may lead to more geometrically adequate coding.
  3. Tied weights are sort of regularisation.

But of course - they're not perfect : they may not be optimal when your data comes from highly nolinear manifold. Depending on size of your data I would try both approaches - with tied weights and not if it's possible.

UPDATE :

You asked also why representation which comes from autoencoder with tight weights might be better than one without. Of course it's not the case that such representation is always better but if the reconstruction error is sensible then different units in coding layer represents something which might be considered as generators of perpendicular features which are explaining the most of the variance in data (exatly like PCAs do). This is why such representation might be pretty useful in further phase of learning.

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
  • ty for the quick response. I understand your answer and did read your "try both approaches" comment but from a theoretical pov how could the untied/independent weights give a superior answer as you end up throwing the decoder away? – Paul O Apr 27 '16 at 14:41
0

Main Advantage is :

  1. Fewer parameters so better generalization ( we are using transpose original weights at the next layer) vs more parameters which leads to the overfitting.
Ravi
  • 2,778
  • 2
  • 20
  • 32