1

I am plotting a regression line with equation for a time series data: however it does not look correct. what could be the issue?

code:

p1<-ggplot(df2,aes(x=date_his,y=his))+geom_line(lwd=0.8)+
  scale_x_date(date_labels = "%Y",expand = c(0, 0))+
  theme(plot.title = element_text(size = 14, hjust = 0.5))+
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "blue", size = 1.2)+
  ggpubr::stat_regline_equation(label.x = -Inf, label.y = Inf, vjust = 1.5, hjust = -0.1, size = 5,color = "blue",
                                formula = y ~ poly(x, 1),
                                show.legend = FALSE
  )

enter image description here

the summary of the regression:

summary(lm(his ~ date_his, data = df2))

Call:
lm(formula = his ~ date_his, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5030 -3.7714  0.0661  4.0350  8.2860 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.682e+02  2.246e+00  74.889  < 2e-16 ***
date_his    -6.247e-04  2.138e-04  -2.922  0.00614 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.867 on 34 degrees of freedom
Multiple R-squared:  0.2007,    Adjusted R-squared:  0.1772 
F-statistic: 8.538 on 1 and 34 DF,  p-value: 0.006141
df2 <- data.frame(
  date_his = as.Date(c(
    "1979-04-28", "1980-04-28", "1981-04-28", "1982-04-28", "1983-04-28",
    "1984-04-28", "1985-04-28", "1986-04-28", "1987-04-28", "1988-04-28",
    "1989-04-28", "1990-04-28", "1991-04-28", "1992-04-28", "1993-04-28",
    "1994-04-28", "1995-04-28", "1996-04-28", "1997-04-28", "1998-04-28",
    "1999-04-28", "2000-04-28", "2001-04-28", "2002-04-28", "2003-04-28",
    "2004-04-28", "2005-04-28", "2006-04-28", "2007-04-28", "2008-04-28",
    "2009-04-28", "2010-04-28", "2011-04-28", "2012-04-28", "2013-04-28",
    "2014-04-28"
  )),
  his = c(
    166.008140637138, 169.867917428802, 157.525649715296, 172.567833065154,
    170.267019131607, 168.725866057929, 166.718135941998, 158.34217326036,
    157.493444169524, 164.212162698115, 161.140482761292, 161.851683272819,
    158.688162091076, 159.249075294438, 153.373329948267, 170.934314928049,
    164.557361076648, 169.910429608586, 163.399094199897, 161.238272986288,
    166.19105244493, 162.451935740926, 157.307524920014, 164.886371717477,
    153.843986514936, 157.656023562196, 164.433730558708, 165.485611515611,
    157.622007235028, 165.006218026182, 161.901572985301, 152.963472898683,
    154.013037508575, 156.507409368617, 157.244735400283, 161.214916194711
  )
)
CovetTachi
  • 71
  • 7
  • 2
    What do you think it should look like, and why? A first-order polynomial is a straight line. – neilfws Apr 28 '23 at 00:07
  • 1
    The y intercept is looking wrong. it is 160 even though its clearly not intersecting there – CovetTachi Apr 28 '23 at 00:22
  • OK; easier to help if you [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including a small representative dataset in a plain text format - for example the output from `dput(df2)`, if that is not too large. – neilfws Apr 28 '23 at 00:26
  • Looks pretty reasonable to my eyes. – IRTFM Apr 28 '23 at 00:41
  • Do the `ggpubr` values agree with the output from `summary(lm(his ~ date_his, data = df2))` ? – neilfws Apr 28 '23 at 00:43
  • 1
    @neilfws i have added the summary. Also dput does not show date values. – CovetTachi Apr 28 '23 at 00:53
  • That's OK, the `dput` output includes `class = "Date"` so the numeric values will convert. – neilfws Apr 28 '23 at 01:13
  • If you find the solution yourself, please post it as an answer to your question rather than editing the question to include it ... – Ben Bolker Apr 30 '23 at 20:24
  • Again, it's probably better to post these edits as an **answer** rather than modifying your questions ... – Ben Bolker Apr 30 '23 at 22:51
  • Oh @BenBolker I was unaware. I'll do it. Thanks. – CovetTachi Apr 30 '23 at 23:38

2 Answers2

1

I think there are a couple of issues here. One is that in stat_regline_equation, the formula:

y ~ poly(x, 1)

generates the equation:

y = 160 - 14x

whereas the formula:

y ~ x

which is the default, generates the equation:

y = 170 - 0.00062x

The second formula agrees with the output from lm(), taking into account that stat_regline_equation rounds to 2 significant figures.

So poly(x, 1) is not the same as y ~ x. The reason is that poly() calculates something called orthogonal polynomials. If you want the same results as y ~ x you need to specify raw = TRUE:

coef(lm(his ~ poly(date_his, 1, raw = TRUE), data = df2))
                 (Intercept) poly(date_his, 1, raw = TRUE) 
                1.681976e+02                 -6.247098e-04 

The other issue is: what x-axis value corresponds to zero, where the intercept crosses? In this case it is "1970-01-01" which is the date origin. So you may want to think about whether extrapolating all the way back to 1970 makes sense in this model, or whether the date values should be transformed into, for example, numbers representing days since some other start date.

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • Hi. I tried your suggestion and replaced the dates on the x axis with numbers. I plotted the same with y ~ x and the results didn't match. (Intercept) 166.29912 is teh reult of summary however while plotting it becomes 170. the same with y ~ poly(x, 1) becomes 160. – CovetTachi Apr 30 '23 at 19:55
  • How did you convert the dates to numbers? – neilfws Apr 30 '23 at 22:31
  • I actually modified the code a little bit. ill add it to the post. I used the year numbers in a separate column to use as x labels and created another column with number of variables ( in my case 36 years) which I used to calculate the regression equation. – CovetTachi Apr 30 '23 at 22:36
  • I edited my answer to provide more details about `poly()`. The basic issue is that by default `poly()` does not calculate the same thing as `y ~ x`. – neilfws Apr 30 '23 at 22:42
  • Thankyou for that, you gave me the idea to not use dates to calculate regression which solved the wrong value issue – CovetTachi Apr 30 '23 at 22:50
0

in my case data is in monthly values in format "X%Y.%m"

df2$date_his<-as.Date(sub("X", "", df2$date_his), format = "%Y")
    Date <- as.POSIXct(df2$date_his, format = "%Y-%m-%d")
    df2$date_his<-format(Date, format="%Y")
    df2$number<-c(1:36)
    p1<-ggplot(df2,aes(x=date_his,y=his))+geom_line(lwd=0.8)+
      geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "dashed") +
      scale_x_continuous(date_labels = "%Y", expand = c(0, 0))+
      theme(plot.title = element_text(size = 18, hjust = 0.5),
            axis.text = element_text(size = 14),
            axis.title = element_text(size = 14),
            axis.ticks = element_line(size = 1))+
      ylim(150,190)
    
      fit1 <- lm(his ~ number, data = df2)
    eqn1 <- sprintf("y = %.2 f + %.2f x", coef(fit1)[1], coef(fit1)[2])
p1<-p1+geom_text(aes(x = -Inf, y = Inf,label = eqn1),vjust = 1.5, hjust = -0.1, size = 7, color = "blue")
CovetTachi
  • 71
  • 7