1

I have a dataframe with two columns

var_1<-seq(1:252)
var_2<-runif(1:252)*1000

my_new_df<-data.frame(var_1,var_2)
names(my_new_df)<-c("Time_values","Count")

train_poly_data<-my_new_df[1:150,c("Time_values","Count")] # training data set
valid_poly_data<-my_new_df[151:200,c("Time_values","Count")] # validation data set

test_poly_data<-my_new_df[201:252,c("Time_values","Count")] # test data set

#obtain a polymomial regression model with 20 Degrees
poly_tr<-lm(train_poly_data$Count ~ poly(train_poly_data$Time_values,degree=20,raw = TRUE))
summary(poly_tr)

#getting the following warnings
Warning messages:
1: 'newdata' had 50 rows but variables found have 150 rows 
2: In predict.lm(poly_tr, valid_poly_data) :
  prediction from a rank-deficient fit may be misleading

Here is what I need to do,

I need to split data frame in train, validation, test data sets Next I want to use polynomial regression using the training data and validate it using the validation data

But I keep on getting the error, how would I resolve the issue, I am also interested in finding the optimal degree of the polynomial as I want to see whether the randomly picked polynomial degree of 20 is kinda correct?

Any suggestions or help to point out my mistake will be always welcome.

How do I fix this warning ? I do understand that the warning is thrown because we have 150 values in training data set and 50 in validation data set

Prradep
  • 5,506
  • 5
  • 43
  • 84
Q007
  • 41
  • 7
  • where is your predict.lm code? – Sandipan Dey Nov 28 '16 at 07:04
  • my_pred<-predict(poly_tr,valid_poly_data) Warning messages: 1: 'newdata' had 50 rows but variables found have 150 rows 2: In predict.lm(poly_tr, valid_poly_data) : prediction from a rank-deficient fit may be misleading – Q007 Nov 28 '16 at 07:07
  • Sorry I forgot to paste it – Q007 Nov 28 '16 at 07:07
  • Couldn't find the reason for the first warning, but the second one is due to: ``if (p < ncol(X) && !(missing(newdata) || is.null(newdata))); warning("prediction from a rank-deficient fit may be misleading")`` (in ``predict.lm``) where ``p <- object$rank`` and ``X <- model.matrix(Terms, m, contrasts.arg = object$contrasts)``. The warning disappears for me when using just ``degree=10`` which does not seem a solution for you. – Phann Nov 28 '16 at 07:34
  • Phann, thanks for your efforts and the insight – Q007 Nov 28 '16 at 08:45
  • Zheyuan Li, this is NOT a duplicate question. I am sorry but you failed to answer the problem. The example that you have suggested has a bit of overlap but IS NOT the proposed solution to my problem. – Q007 Nov 29 '16 at 01:00

1 Answers1

1

The first warning will go away you need to convert the validation data to the same format as the training data before you run predict, to ensure that both the training / validation data have exactly the same set of regressors / predictor variables.

The 2nd warning will still be there, since you are fitting a very high degree polynomial, it's a rank-deficient fit (also it is highly likely to overfit your training data, so the model may not be generalizable / useful).

What you can do instead to reduce the overfitting / eliminate rank-deficiency is to fit a lower degree polynomial, in which case both the warnings will go away.

Try this to get rid of both the warnings:

my_new_df<-data.frame(var_1,var_2)
names(my_new_df)<-c("Time_values","Count") 

n <- 10 # lower degree polynomial
# first generate all the polynomial regressors on the entire data
my_new_df <- cbind.data.frame(my_new_df[-1], poly(my_new_df$Time_values, degree=n, raw=TRUE))
names(my_new_df)[-1] <- paste0('X', names(my_new_df)[-1])

train_poly_data<-my_new_df[1:150,] # training data set
valid_poly_data<-my_new_df[151:200,] # validation data set

test_poly_data<-my_new_df[201:252,] # test data set

#obtain a polymomial regression model with n Degrees
poly_tr<-lm(Count ~ ., train_poly_data)
summary(poly_tr)
pred <- predict(poly_tr, newdata=valid_poly_data)
pred


 # 151          152          153          154          155          156           
 # 796.5672     982.6862    1219.7434    1517.9844    1889.2235    2347.0258 
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63