Blog: Machine Learning Equations by Saurabh Verma

Blog: Why Regularized Auto-Encoders learn Sparse Representation?

• PaperList

This paper has some great insights to offer in design of autoencoders. As title suggests, it addresses “whether regularized helps in learning sparse representation of the data theoretically?”

Here is the setup: we have an input which is mapped to latent space via autoencoder where is encoder activation function, is the weight matrix, and is the encoder bias and is hidden representation or outputs of hidden units.

For analysis, paper assumes that decoder is linear i.e. decodes back the encoded hidden representation and loss is squared loss function. Therefore, for learning parameters of autoencoder; objective function is:

Now we are interested in sparsity of , hidden representation.

Besides above, paper makes two more assumptions.

Assumption 1: Data is drawn from a distribution for which and , where is identity matrix.

Basically, it assumes that data is whitened which is reasonable for some cases. There is one more assumption from analysis point of view which is needed for the derivation of theorems, we will see that later.

Finally, for a give data sample , each hidden unit gets activated if pre-activation unit is greater than threshold:

So, the sparsity of depends upon the value of pre-activation units (whose value further depend upon the input data samples). If the expected value of pre-activation unit is less than threshold over data distribution then the corresponding hidden unit expected output is zero.

The idea is to somehow show that regularization encourages the expected value of pre-activation unit to reduce on average in autoencoders. This indirectly induces sparsity in , hidden representation.

This should hint us that: monotonically increasing activation functions would be preferable because we want to decrease hidden unit value as the value of pre-activation unit decreases. On the top of that, if we consider monotonically increasing activation functions with negative saturation at , i.e. then we can make sure that lower average pre-activation value implies higher sparsity. For example, consider function whose range is . If we reduce below zero, hidden unit output will move away from .

Let’s focus on analysis now. Assume we want to compute using gradient descent. Then at iteration of updating, we will have:

where are values in iteration.

Let regularized objective function of autoencoder is expressed as:

where is the regularization term and . Simplifying the main theorem of the paper over here:

Theorem 1: For each iteration: holds for with probability , if following conditions are met: and encoding activation function, , first derivative is bounded in .

The above theorem shows that updating along the negative gradient of results in with high probability if particular regularization condition is met (plus activation function condition). This means certain regularization terms explicitly encourages sparsity in hidden representation.

Corollary 1: Theorem 1 holds for two special cases:
1) activation function is non-decreasing and regularization term has form for some monotonically increasing function
2) activation function is convex plus non-decreasing and regularization term has form where is natural and whole number.

We are ready to make some remarks about in practice activation functions!

ReLu: It satisfies both corollaries. Hence can be used with any regularized of first form but not the second form (because second derivative does not exist). The advantage of ReLU is that it enforces hard zeros in the learned representations.

Softplus: It satisfies both corollaries and encourages sparsity for both suggested regularization form. But does not produce hard zeros.

Sigmoid: Satisfies only first corollary. Hence sigmoid is not guaranteed to lead to sparsity when used with second regularizations form.

Maxout and Tanh: Do not satisfies negative saturation property at and hence may or may not lead to sparsity.

Final remarks about about in practice regularized autoencoders!

  1. De-noising Autoencoder (DAE) has regularization of second form.
  2. Contractive Auto-Encoder (CAE) including higher order has regularization of second form.
  3. Marginalized De-noising Auto-Encoder (mDAE) has regularization of second form.
  4. Sparse Auto-Encoder (SAE) has regularization of first form.

Thus, all above autoencoders encourages sparsity theoretically, if chosen with appropriate activation function.

References:

  1. Arpit, D., Zhou, Y., Ngo, H., & Govindaraju, V. (2015). Why Regularized Auto-Encoders learn Sparse Representation? ICML2016. [PDF]
comments powered by Disqus