0

I'm building a regression model with several date and numeric variables. I do a quick check on one of the date variables

    lm.fit = lm(label ~ Firstday, data = rawdata)
    summary(lm.fit)$r.squared   

to gauge its predictive influence on the model. This accounted for 41% of the variance. I now attempted to change the date to numeric so I can work better with the variable. I used the command

    as.numeric(as.POSIXct(rawdata$Firstday, format = "%Y-%m-%d"))

Doing this reduced the variance to 10% - which is not what I want. What am I doing wrong and how do I go about it?

I've looked at https://stats.stackexchange.com/questions/65900/does-it-make-sense-to-use-a-date-variable-in-a-regression but the answer is not clear to me.

Edit 1:

A reproducible code sample of what I did is shown below:

 label = c(0,1,0,0,0,1,1)
 Firstday = c("2016-04-06", "2016-04-05", "2016-04-04",
     "2016-04-03", "2016-04-02", "2016-04-02","2016-04-01")
 lm.fit <- lm(label ~ Firstday)
 summary(lm.fit)$r.squared

[1] 0.7083333

On changing to numeric:

 Firstday = as.numeric(as.POSIXct(Firstday, format="%Y-%m-%d"))

I now get

 lm.fit <- lm(label ~ Firstday)
 summary(lm.fit)$r.squared

 [1] 0.1035539
Community
  • 1
  • 1
Mikee
  • 783
  • 1
  • 6
  • 18
  • Can you please include data and/or code that will provide us with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) ? – Ben Bolker Jul 19 '16 at 13:14

1 Answers1

3

It's because your original list of dates is actually just a list of items, without any date sequence information.

See below how I change them to arbitrary letters to get the same result. The third code snippet returns the same r2 as the first code snippet.

label <- c(0,1,0,0,0,1,1)
Firstday1<- c("2016-04-06","2016-04-05","2016-04-04","2016-04-03","2016-04-02","2016-04-02","2016-04-01")
str(Firstday1)
lm.fit1 <- lm(label~Firstday1)
summary(lm.fit1)$r.squared
[1] 0.7083333


Firstday2 <- as.numeric(as.POSIXct(Firstday1,format="%Y-%m-%d"))
str(Firstday2)
lm.fit2 <- lm(label ~ Firstday2)
summary(lm.fit2)$r.squared
[1] 0.1035539


Firstday3<- c("a","b","c","d","e","e","f")
str(Firstday3)
lm.fit3 <- lm(label~Firstday3)
summary(lm.fit3)$r.squared
[1] 0.7083333
ddunn801
  • 1,900
  • 1
  • 15
  • 20
  • Very true! If I understand you correctly, it is more of a pattern recognition result than date related result. If that is the case, how do I convert the list of 'date' items to a numeric format that will still capture the original pattern? – Mikee Jul 19 '16 at 13:55
  • It sounds like what you may be after is time series analysis. – ddunn801 Jul 19 '16 at 14:18