Using large input values with Auto Encoders

Question

I have created an Auto Encoder Neural Network in MATLAB. I have quite large inputs at the first layer which I have to reconstruct through the network's output layer. I cannot use the large inputs as it is,so I convert it to between [0, 1] using sigmf function of MATLAB. It gives me a values of 1.000000 for all the large values. I have tried using setting the format but it does not help.

Is there a workaround to using large values with my auto encoder?

hve you considered using other non-linearity at the output of the layer, like changing `sigmf` to something else, such as `log` maybe? — Shai, Jul 14 '14 at 09:08
As I am using the sigmoid function within the network, so the output values cannot exceed 1. Using log will give me larger values than that. — Sasha, Jul 14 '14 at 09:12
I'm understanding that your problem is that the inputs are too large, and you are using sigmf to preprocess your inputs before use the Autoencoder, right? — Pablo EM, Jul 14 '14 at 09:45
Yes. The inputs are not too large, but large enough for sigmoid function to produce same mappings for multiple values. — Sasha, Jul 14 '14 at 10:09
@Sasha won't your inputs will be multiplied by weights before the network applies the sigmoid function? So the weights can reduce the values to below one. It sounds like you're looking to normalize your data so that all your inputs have similar magnitude ranges, you can do this by subtracting the mean for each variable and then dividing by the standard deviation. This process is often called standardizing the inputs. — Dan, Jul 14 '14 at 10:26
@Dan yes they would be. But the actual inputs are large, and the auto-encoder would try to reconstruct it, right? So how can it do so when it uses the sigmoid activation? — Sasha, Jul 15 '14 at 06:20
@Sasha Sound like the sigmoid function is not an appropriate activation function then. But if you really want to use it, Pablo's answer shows you how to normalize your data to do so. Just don't forget to denormalize the output by using the same original parameters (in this case max and min) in the exact reverse of the process — Dan, Jul 15 '14 at 06:36
@Dan Thanks. I would appreciate if you give me suggestions on a better activation function. — Sasha, Jul 15 '14 at 06:38
@Sasha That's a question for Cross Validated. Personally I would just pick a few and choose based on a validation set. But the people at Cross Validated should have better intuition in that respect. Shai provided some in his answer though... — Dan, Jul 15 '14 at 06:40

score 4 · Answer 1 · edited May 23 '17 at 10:30

4

The process of convert your inputs to the range [0,1] is called normalization, however, as you noticed, the sigmf function is not adequate for this task. This link maybe is useful to you.

Suposse that your inputs are given by a matrix of N rows and M columns, where each row represent an input pattern and each column is a feature. If your first column is:

vec =

   -0.1941
   -2.1384
   -0.8396
    1.3546
   -1.0722

Then you can convert it to the range [0,1] using:

%# get max and min
maxVec = max(vec);
minVec = min(vec);

%# normalize to -1...1
vecNormalized = ((vec-minVec)./(maxVec-minVec))

vecNormalized =

    0.5566
         0
    0.3718
    1.0000
    0.3052

As @Dan indicates in the comments, another option is to standarize the data. The goal of this process is to scale the inputs to have mean 0 and a variance of 1. In this case, you need to substract the mean value of the column and divide by the standard deviation:

meanVec = mean(vec);
stdVec = std(vec);

vecStandarized = (vec-meanVec)./ stdVec

vecStandarized =

    0.2981
   -1.2121
   -0.2032
    1.5011
   -0.3839

edited May 23 '17 at 10:30

Community

1
1

answered Jul 14 '14 at 10:10

Pablo EM

6,190
3
29
37

1

Please include a brief description of the solution in that link in your actual answer - otherwise I agree that this is the correct way to go – Dan Jul 14 '14 at 10:23
@Dan applying different normalization/standartization for each sample might be a bit problematic: you may change the underlying distribution of the samples. It may be better to apply the **same** normalization to all (train as well as test) inputs. You may learn a **global** `minVal` and a **global** `maxVal` to be used to normalize ALL inputs training and test. – Shai Jul 14 '14 at 11:50
"standartization" does not seem to be the best solution here as "standarize" input does not lie in the range [0..1] but rather around 0 - it is not exactly compatible with `sigmf` range. – Shai Jul 14 '14 at 11:52
@Shai I disagree on both points. I personally think that the OP's need for the data to be in [0,1] is erroneous which is why I mentioned standardization in the comment to the question. As for the global norm, I've never seen it done that way but would be interested to read more about it. But as far as I know, scaling your inputs should be done separately to prevent those with higher orders of magnitude rendering those that are small irrelevant. I don't think that linear scaling would effect the dependencies of the distributions but I could be wrong. I've just never seen it done that way before – Dan Jul 14 '14 at 13:00
@Dan AFAIK when you PCA your data you use the same mean and coeffs to normalize both train and test data, I don't see why scaling should be any different? Applying global shift and scale you maintain the (high-dim) geometrical properties of the data points. However, when applying different shift/scale to each point you effectively make them change location with respect to each other. I don't think that this local changes is a good thing in learning. Again, this is my opinion. – Shai Jul 14 '14 at 13:41
1

@Shai You should definitely apply the same normalization parameters to the test set as to the training set, but that is per input. Each input variable however should be normalized independently from one other input variable. This is very standard practice in learning algorithms. Actually, I think there is an implicit assumption that all the input variables are independent, statisticians usually refer to these as the independent variables. If you want to model the dependencies you often have to include an interaction term (such as X1*X2) as a separate new input variable. – Dan Jul 14 '14 at 13:48
1

@Dan I suspect we have a slight misunderstanding. What do you mean by "per input"? Suppose we have `n` samples of dimension `p`. Is it "per training sample", that is you have `n` different normalization parameters? Or, is it "per input dimension" in which case you have `p` different normalization parameters? – Shai Jul 14 '14 at 13:54
1

@Shai I have been referring to "per input dimension" – Dan Jul 14 '14 at 14:01
@Dan in that case we are totally in agreement. I'm sorry if I was not clear in my previous remarks. – Shai Jul 14 '14 at 14:27

Shai · Answer 2 · 2016-01-05T06:12:45.990

Before I give you my answer, let's think a bit about the rationale behind an auto-encoder (AE):
The purpose of auto-encoder is to learn, in an unsupervised manner, something about the underlying structure of the input data. How does AE achieves this goal? If it manages to reconstruct the input signal from its output signal (that is usually of lower dimension) it means that it did not lost information and it effectively managed to learn a more compact representation.

In most examples, it is assumed, for simplicity, that both input signal and output signal ranges in [0..1]. Therefore, the same non-linearity (sigmf) is applied both for obtaining the output signal and for reconstructing back the inputs from the outputs.
Something like

output = sigmf( W*input + b ); % compute output signal
reconstruct = sigmf( W'*output + b_prime ); % notice the different constant b_prime

Then the AE learning stage tries to minimize the training error || output - reconstruct ||.

However, who said the reconstruction non-linearity must be identical to the one used for computing the output?

In your case, the assumption that inputs ranges in [0..1] does not hold. Therefore, it seems that you need to use a different non-linearity for the reconstruction. You should pick one that agrees with the actual range of you inputs.

If, for example, your input ranges in (0..inf) you may consider using exp or ().^2 as the reconstruction non-linearity. You may use polynomials of various degrees, log or whatever function you think may fit the spread of your input data.

Disclaimer: I never actually encountered such a case and have not seen this type of solution in literature. However, I believe it makes sense and at least worth trying.

Using large input values with Auto Encoders

2 Answers2