2

I have a dataset that looks to be piecewise linear. I would like to perform a segmented linear regression in R. The issue is that there is a discontinuity at the breakpoint. By using some pieces of code from this question I managed to get something, but I am not satisfied.

Dataset

Here is a dummy dataset.

equation of the dataset

NB = 100
A1 = 2 # coeff for first part
A2 = 1 # coeff for second part
B1 = 0 # intercept for first part
B2 = 300 # intercept for second part
df = data.frame(n=1:NB)
df$n = sample(500, size=NB, replace=TRUE)
df$noise = sample(20, size=NB, replace=TRUE)-10
my_func <- function(n, noise) {
    if(n < 100) {
        return(A1*n+B1 + noise)
    }
    else {
        return(A2*n+B2 + noise)
    }
}
df$fn = mapply(my_func, df$n, df$noise)

Using segmented package

This is quite straightforward, we simply perform a classical linear regression and give it to segmented.

library(segmented)
library(ggplot2)
model_segmented = segmented(lm(fn~n, data=df), seg.Z = ~ n)
predict_segmented = data.frame(n = df$n, fn = broken.line(model_segmented)$fit)
ggplot(df, aes(x = n, y = fn)) +
    geom_point() + geom_line(data = predict_segmented, color = 'blue')

Gives:

plot of the raw data and its linear regression using segmented

Obviously, segmented expects the data to be continuous. It is not the case here, so the regression is not correct.

“Manual” method

This method is more tedious. First, we compute the break-point by trying all the possible break points and keeping the one which yields the lowest residual. Then, we add a new factor in the linear regression, which tells if the predictor variable is greater or lower than this breakpoint.

# Computation of the break-point
Break<-sort(unique(df$n))
Break<-Break[2:(length(Break)-1)]
d<-numeric(length(Break))
for (i in 1:length(Break)) {
    model_manual<-lm(fn~(n<Break[i])*n + (n>=Break[i])*n, data=df)
    d[i]<-summary(model_manual)[[6]]
}
breakpoint = Break[which.min(d)]

# Linear regression using this break-point
df$group = df$n >= breakpoint
model_manual<-lm(fn~n*group, data=df)
dat_pred = data.frame(n = df$n, fn = predict(model_manual, df))
ggplot(df, aes(x = n, y = fn)) +
    geom_point() +
    geom_line(data=dat_pred[dat_pred$n < breakpoint,], color = 'blue') +
    geom_line(data=dat_pred[dat_pred$n >= breakpoint,], color = 'blue')

Gives:

enter image description here

Here, the regression is great.

Question

Is there a better way to achieve this goal? Can the segmented package take discontinuous data, or is there a package that can do this?

My concern is that the second method is a bit long and not very readable.

Tom Cornebize
  • 1,362
  • 15
  • 33

2 Answers2

3

After spending a tremendous amount of time digging, I believe the chngpt package is the way to go. It can do both continuous and discontinuous segmented regressions. Link here: https://cran.r-project.org/web/packages/chngpt/vignettes/chngpt-vignette.pdf

  • Thanks for pointing out the chngpt package. I think this answer should be accepted. chngpt can do this using the "stegmented" model. Note the "t". This is not a typo. The model you want is called "stegmented" as opposed to "segmented". I emphasize it because it took me a bit of time to figure it out. – stacksia Jun 22 '20 at 22:41
2

strucchange will detect the breakpoint using statistically valid methods. Then, you can fit each piece with whatever model you want. For example, with a seasonal time series you can apply separate ARIMA models to each segment.

Dave B.
  • 21
  • 2