Multivariate Linear Regression in MATLAB

Question

I already have my data prepared in terms of:

p1=input1 %load of today current hour
p2=input2 %load of today past one hour
p3=input3 $load of today past two hours
a1=output %load of next day current hour

I have the following code below:

%Input Set 1 For Weekday Load(d+1,t)
%(d,t),(d,t-1), (d,t-2)

L=xlsread('input_set1_weekday.xlsx',1); %2011

k=1;
size(L,1);

for a=5:2:size(L,1)-48 % L load for 2011    
    P(1,k)= L(a,1);
    P(2,k)= L(a-2,1);
    P(3,k)= L(a-4,1);
    P(4,k)= L(a+48,1);
    k=k+1;
end

I have my data arranged in such a way that in every column, p1, p2, p3 are my predictor variables and a1 is my response variable.

How do I now fit a linear model to this set of data to check the performance of my predictions? By the way it is electrical load forecasting model.

My other doubt is that in the examples shown by most of the sources, they use the last column data as response variable and this is the part I'm struggling with.

Welcome to Stack Overflow! As we would like to provide you some help, could you first detail your question with your data example, current code examples, and how you calculate an electrical load forecasting model? Honestly we are Matlab coders that may not be working in your field of research. — Yvon, Jul 30 '14 at 02:10
%Input Set 1 For Weekday Load(d+1,t) %(d,t),(d,t-1), (d,t-2) L=xlsread('input_set1_weekday.xlsx',1); %2011 k=1; size(L,1); for a=5:2:size(L,1)-48 % L load for 2011 P(1,k)= L(a,1); P(2,k)= L(a-2,1); P(3,k)= L(a-4,1); P(4,k)= L(a+48,1); k=k+1; end — user3800114, Jul 30 '14 at 02:16

rayryeng · Accepted Answer · 2014-07-30T07:06:42.690

fitlm will be able to do this for you quite nicely. You use fitlm to train a linear regression model, so you provide it the predictors as well as the responses. Once you do this, you can then use predict to predict the new responses based on new predictors that you put in.

The basic way for you to call this is:

lmModel = fitlm(X, y, 'linear', 'RobustOpts', 'on');

X is a data matrix where each column is a predictor and each row is an observation. Therefore, you would have to transpose your matrix before running this function. Basically, you would do P(1:3,:).' as you only want the first three rows (now columns) of your data. y would be your output values for each observation and this is a column vector that has the same number of rows as your observations. Regarding your comment about using the "last" column as the response vector, you don't have to do this at all. You specify your response vector in a completely separate input variable, which is y. As such, your a1 would serve here, while your predictors and observations would be stored in X. You can totally place your response vector as a column in your matrix; you would just have to subset it accordingly.

As such, y would be your a1 variable, and make sure it's a column vector, and so you can do this a1(:) to be sure. The linear flag specifies linear regression, but that is the default flag anyway. RobustOpts is recommended so that you can perform robust linear regression. For your case, you would have to call fitlm this way:

lmModel = fitlm(P(1:3,:).', a1(:), 'linear', 'RobustOpts', 'on');

Now to predict new responses, you would do:

ypred = predict(lmModel, Xnew);

Xnew would be your new observations that follow the same style as X. You have to have the same number of columns as X, but you can have as many rows as you want. The output ypred will give you the predicted response for each observation of X that you have. As an example, let's use a dataset that is built into MATLAB, split up the data into a training and test data set, fit a model with the training set, then use the test dataset and see what the predicted responses are. Let's split up the data so that it's a 75% / 25% ratio. We will use the carsmall dataset which contains 100 observations for various cars and have descriptors such as Weight, Displacement, Model... typically used to describe cars. We will use Weight, Cylinders and Acceleration as the predictor variables, and let's try and predict the miles per gallon MPG as our outcome. Once I do this, let's calculate the difference between the predicted values and the true values and compare between them. As such:

load carsmall; %// Load in dataset

%// Build predictors and outcome
X = [Weight Cylinders Acceleration];
y = MPG;

%// Set seed for reproducibility
rng(1234);

%// Generate training and test data sets
%// Randomly select 75 observations for the training
%// dataset.  First generate the indices to select the data
indTrain = randperm(100, 75);

%// The above may generate an error if you have anything below R2012a
%// As such, try this if the above doesn't work
%//indTrain = randPerm(100);
%//indTrain = indTrain(1:75);

%// Get those indices that haven't been selected as the test dataset
indTest = 1 : 100;
indTest(indTrain) = [];

%// Now build our test and training data
trainX = X(indTrain, :);
trainy = y(indTrain);
testX = X(indTest, :);
testy = y(indTest);

%// Fit linear model
lmModel = fitlm(trainX, trainy, 'linear', 'RobustOpts', 'on');

%// Now predict
ypred = predict(lmModel, testX);

%// Show differences between predicted and true test output
diffPredict = abs(ypred - testy);

This is what happens when you echo out what the linear model looks like:

lmModel = 


Linear regression model (robust fit):
    y ~ 1 + x1 + x2 + x3

Estimated Coefficients:
                Estimate        SE         tStat       pValue  
               __________    _________    _______    __________

(Intercept)        52.495       3.7425     14.027    1.7839e-21
x1             -0.0047557    0.0011591    -4.1031    0.00011432
x2                -2.0326      0.60512     -3.359     0.0013029
x3               -0.26011       0.1666    -1.5613       0.12323


Number of observations: 70, Error degrees of freedom: 66
Root Mean Squared Error: 3.64
R-squared: 0.788,  Adjusted R-Squared 0.778
F-statistic vs. constant model: 81.7, p-value = 3.54e-22

This all comes from statistical analysis, but for a novice, what matters are the p-values for each of our predictors. The smaller the p-value, the more suitable this predictor is for your model. You can see that the first two predictors: Weight and Cylinders are a good representation on determining the MPG. Acceleration... not so much. What this means is that this variable is not a meaningful predictor to use, so you should probably use something else. In fact, if you were to remove this predictor and retrain your model, you would most likely see that the predicted values would closely match those where the Acceleration was included.

This is a truly bastardized version of interpreting p-values and so I defer you to an actual regression models or statistics course for more details.

This is what we have predicted the values to be, given our test set and beside it what the true values are:

>> [ypred testy]

ans =

17.0324   18.0000
12.9886   15.0000
13.1869   14.0000
14.1885       NaN
16.9899   14.0000
29.1824   24.0000
23.0753   18.0000
28.6148   28.0000
28.2572   25.0000
29.0365   26.0000
20.5819   22.0000
18.3324   20.0000
20.4845   17.5000
22.3334   19.0000
12.2569   16.5000
13.9280   13.0000
14.7350   13.0000
26.6757   27.0000
30.9686   36.0000
30.4179   31.0000
29.7588   36.0000
30.6631   38.0000
28.2995   26.0000
22.9933   22.0000
28.0751   32.0000

The fourth actual output value from the test data set is NaN, which denotes that the value is missing. However, when we run our the observation corresponding to this output value into our linear model, it predicts a value anyway which is to be expected. You have other observations to help train the model that when using this observation to find a prediction, it would naturally draw from those other observations.

When we compute the difference between these two, we get:

diffPredict =

 0.9676
 2.0114
 0.8131
    NaN
 2.9899
 5.1824
 5.0753
 0.6148
 3.2572
 3.0365
 1.4181
 1.6676
 2.9845
 3.3334
 4.2431
 0.9280
 1.7350
 0.3243
 5.0314
 0.5821
 6.2412
 7.3369
 2.2995
 0.9933
 3.9249

As you can see, there are some instances where the prediction was quite close, and others where the prediction was far from the truth.... it's the crux of any prediction algorithm really. You'll have to play around with what predictors you want, as well as playing with the options with your training. Have a look at the fitlm documentation for more details on what you can play around with.

Edit - July 30th, 2014

As you don't have fitlm, you can easily use LinearModel.fit. You would call it with the same inputs like fitlm. As such:

lmModel = LinearModel.fit(trainX, trainy, 'linear', 'RobustOpts', 'on');

This should give you exactly the same results. predict should exist pre-R2014a, so that should be available to you.

Good luck!

Thanks alot. Appreciate it much. however, im using 2013a which doesnt have the fitlm function. im using LinearModel.fit which i think does the same function too ? — user3800114, Jul 30 '14 at 06:11
@user3800114 - `linearmodel.fit` will work yes. I'll modify my post. — rayryeng, Jul 30 '14 at 07:02
@user3800114 - Done. `LinearModel.fit` should do the same as `fitlm`, but there is a warning that this function will be removed for later versions so whenever you upgrade your MATLAB (if you do), use `fitlm`. Good luck! — rayryeng, Jul 30 '14 at 07:27
I got my original and predicted data plotted. looking at the P-values, mine were quite small which indicates the predictor variables are pretty significant, however the RMSE remains very high, how would i further improve it and perhaps remove some of the outliers value? thanks — user3800114, Jul 30 '14 at 15:52
I would do some sort of pre-processing first to ensure that your predictors are all standardized. Take a look at this post: http://stackoverflow.com/questions/10119913/pca-first-or-normalization-first . Normalize your data, then apply PCA on it, then train using this transformed dataset. Also check: http://matlabdatamining.blogspot.ca/2010/02/principal-components-analysis.html and http://matlabdatamining.blogspot.ca/2010/02/putting-pca-to-work.html — rayryeng, Jul 30 '14 at 16:00
One other thing I would suggest is to remove missing values. Check your training and test data and see if any of the outcomes are `NaN`. You can then filter this by using `isnan`. This may help, but I'm not sure as I haven't tried. — rayryeng, Jul 30 '14 at 16:12
Thanks. How would I do a MAPE calculation script to check my performance ? — user3800114, Aug 01 '14 at 01:38
One more question, now that i can do a linear model fit, how would i write my arguments to fit a non-linear model ? Like NonlmModel = NonLinearModel.fit(trainX, trainy, 'linear', 'RobustOpts', 'on'); — user3800114, Aug 01 '14 at 02:15
@user3800114 http://www.mathworks.com/help/stats/examples/weighted-nonlinear-regression.html — rayryeng, Aug 01 '14 at 02:45
Hi ray, i tried but stuck in non-linear model. can u apply for me for the dataset i'm having. how to fit the non-linear model to my data. — user3800114, Aug 04 '14 at 01:49
@user3800114 - I'll leave this up to you. The example that I showed in the above link should be enough to get you started. Good luck! — rayryeng, Aug 04 '14 at 01:50
Could you help me plot the regression line for the linear model with 3 variables? Here is my question !!! http://stackoverflow.com/questions/33387195/how-to-plot-the-data-for-linear-model-with-3-variables-in-matlab — Leeway, Oct 29 '15 at 00:44

Multivariate Linear Regression in MATLAB

1 Answers1

Edit - July 30th, 2014