-8
p=(-50:50)^2
y=c(p, 2500+10*(1:99), p+1000)
plot(seq_along(y), y+100*rnorm(length(y)))

Suppose that I have a dataset like above, in which only a subset of the data is linear. Plain linear regression like lm() in R does not intelligently find out the region (100 to 200 in this example) in which linear fit is appropriate.

How to find out which part of the data is linear and perform fit just in this subset of dataset? Solutions in both R and python are welcome.

enter image description here

Note, the date shown above is just an example, the method should be robust with respect to an arbitrary data set as long as it contains a linear portion. When there are multiple linear portions, it should also show those multiple linear portions. If no linear portions, it should show no linear portions are found.

EDIT: Statistical methods may not be appropriate to robustly solve this problem in general. I added the computer-vision and machine-learning tag. Maybe methods in methods in these areas are more appropriate to solve this problem robustly in general?

user1424739
  • 11,937
  • 17
  • 63
  • 152
  • 5
    Seems more like a stats question since your clearly asking about a method and not a specific way of coding – Dason Aug 28 '21 at 01:17
  • The linear region is between 100 and 200 in the example. Data in this part can be fit with `lm()` very well. – user1424739 Aug 28 '21 at 01:26
  • @Dason Maybe there is already some code that can be adapted to this case, as this seem to be a common question in practice. – user1424739 Aug 28 '21 at 01:27
  • 4
    This is **not** an easy question to answer in a robust/general way. – Ben Bolker Aug 28 '21 at 01:51
  • 1
    The `strucchange` package fits *linear* breakpoint models, but that's already a broad & deep area of statistical models. Generalizing to piecewise polynomial models (for example) would be quite challenging. – Ben Bolker Aug 28 '21 at 01:57
  • @BenBolker Done. I forget to add p in the example. Please reopen it. – user1424739 Aug 28 '21 at 02:11
  • 1
    I think this will only be answerable to the OP's satisfaction **if** someone has already written a good general solver for this type of problem (which I doubt; as I said before, it's a very hard problem to solve in general). I've tried a few fits using the `segmented` package to fit piecewise-quadratic models, which quickly makes it obvious that it's going to be hard. – Ben Bolker Aug 28 '21 at 02:17
  • @MrFlick That is part of the problem. I'd think L1 or L2 norm or some kind of norm. But the exact creteiron may determine how the problem is best solved. Some definition is harder. But I don't think it is the key difficulty of the problem once the norm is picked. – user1424739 Aug 28 '21 at 02:29

2 Answers2

5

I don't know a good built-in way to do this, and as Ben Bolker and others noted, this is not a straightforward question to answer in a robust, generalizable way. That said, I had some success with this specific question using a brute force approach. Since I'm more comfortable with tidyverse syntax, I used that, but I'm certain this could be done in a similar fashion in base R.

First, I created a grid of ranges to explore, based on starting x and the length of the sequence. Adjust the granularity depending on how much computation you want to do. For a quick approach I used every 5 x and lengths that are multiples of 5. That gave me 1,830 ranges of x, to which I appended the associated y's. Then I nested the x and y into a new column, data.

# From OP
p=(-50:50)^2
y=c(p, 2500+10*(1:99), p+1000)


library(tidyverse); library(broom)

df1 <- data.frame(x = seq_along(y), y = y+100*rnorm(length(y)))

df1_ranges = crossing(start  = seq.int(1, max(df1$x), by = 5), 
                      length = seq.int(5, 300, by = 5)) %>%
    mutate(end = start + length - 1) %>%
    filter(end <= max(df1$x)) %>%     # only keep ranges within the data
    uncount(length, .id = "x") %>%    # for each x, put in "length" many rows
    mutate(x = start + x - 1) %>%     # update x to run from "start" to "end"
    left_join(df1) %>%
    nest(data = c(x, y))

Not I can run lm regressions on each of those ranges. This takes about 9 seconds on my computer. You could speed it up by looking at fewer distinct ranges, or being cleverer about the search space.

df1_regressions <- df1_ranges %>%
    mutate(fit = map(data, ~lm(y~x, data = .x)),   # run lm's
           glance = map(fit, glance),              # summary of fit
           tidied = map(fit, tidy))                # extract coefficients

Skipping to the chase, for this example the regions with the best linear fit have the lowest standard error of the regression term. Sure enough, this identifies the right spot, ranging from about 100 to 200.

df1_tidied <- df1_regressions %>%
    select(start:end, tidied) %>%
    unnest(tidied) %>%
    filter(term == "x")

df1_tidied %>%
    ggplot(aes(x =  start, y = end-start, fill = 1/std.error)) +
    geom_tile() +
    geom_text(data = . %>% filter(std.error == min(std.error)) %>% 
              mutate(text = glue::glue("({start}, {end-start})")), 
          aes(label = text), color = "white", vjust = -0.5) +
    scale_fill_viridis_c(direction = -1, option = "C")

enter image description here

Whew! Now that that's out of the way, we could do what you originally asked and see the fitted regression just for that section.

df1_tidied %>% 
    slice_min(std.error) %>%
    select(start,end) %>%
    left_join(df1_ranges) %>%
    mutate(fit = map(data, ~lm(y~x, data = .x)),
           augment = map(fit, augment)) %>% 
    unnest(augment) -> df1_fitted

ggplot(df1, aes(x,y)) + 
    geom_point() +
    geom_line(data = df1_fitted, aes(y = .fitted), color = "red", size = 2)

enter image description here

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • 6
    This is a far better answer than the question deserved, imho. – Limey Aug 28 '21 at 05:59
  • I thought it was an interesting thing to try. – Jon Spring Aug 28 '21 at 06:00
  • That's actually a great answer. – Martin Gal Aug 28 '21 at 09:43
  • I got the following error. How to fix it? ``` R> df1_regressions <- ... Error: Problem with `mutate()` column `glance`. ℹ `glance = map(fit, glance)`. ✖ object 'glance' not found Run `rlang::last_error()` to see where the error occurred. R> rlang::last_error() Problem with `mutate()` column `glance`. ℹ `glance = map(fit, glance)`. ✖ object 'glance' not found Backtrace: 1. `%>%`(...) 9. base::.handleSimpleError(...) 10. dplyr:::h(simpleError(msg, call)) Run `rlang::last_trace()` to see the full context. ``` – user1424739 Aug 28 '21 at 12:12
  • tidyverse has many disadvantages. https://stackoverflow.com/questions/61393345/are-there-any-disadvantages-to-using-tidyverse So I am not familiar with it and don't use it. In particular, it is too verbose. The equivalent base R should be more concise? Other people who can provide an equivalent base R solution is welcome. – user1424739 Aug 28 '21 at 12:15
  • Yes, edited to add `library(broom)`. I forgot, it's installed with `tidyverse`, but is not a core package so needs to be loaded separately. – Jon Spring Aug 28 '21 at 17:23
5

Try dpseg in the dpseg package. We restrict the minimum length to 50 to avoid short linear stretches which may occur by chance. There are other tuning parameters available. See ?dpseg and the vignette that comes with the package for more information.

To make the input reproducible we need to use set.seed and have done this in the Note at the end.

library(dpseg)
segs <- dpseg(x = x, y = y, minl = 50); segs
## ... this output is shown just before the image ...
subset(segs$segments, var < 20000)
##    x1  x2 start end intercept    slope        r2      var
## 3 116 203   116 203  1458.242 10.15865 0.8613225 10844.35

plot(segs)

giving the following where we see that the third segment as per output above has the least variance.

> segs

calculating recursion for 301 datapoints

dynamic programming-based segmentation of 301 xy data points:

   x1  x2 start end  intercept     slope        r2      var
1   1  50     1  50   2165.902 -51.13574 0.9212552 47495.24
2  50 116    50 116  -2928.772  50.00892 0.9521128 47756.06
3 116 203   116 203   1458.242  10.15865 0.8613225 10844.35
4 203 252   203 252  12533.408 -47.39630 0.9189915 42079.16
5 252 301   252 301 -12405.806  51.67657 0.9261061 45278.70

Parameters: type: var; minl: 50; maxl: 301; P: 0; jumps: 0 

screenshot

Note

set.seed(123)
p <- (-50:50)^2
y <- c(p, 2500+10*(1:99), p+1000)
y <- y+100*rnorm(length(y))
x <- seq_along(y)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This method does not recognize that 0->50, 50->100, 200->250, 250->300 are part of parabolic regions. They are quite obvious to human eyes. So this is a good step toward the solution, but it is still not an ideal solution (I understand the ideal solution might be difficult, that is why added the computer-vision tag to the problem.) – user1424739 Aug 28 '21 at 12:39
  • 1
    I thought it was clear that you would take segments with small variance but apparently it was not so I have explicitly added the final step of thresholding `segs` to give the linear segment(s) -- just the third segment in this case, as required. Adjust the parameters to suit. – G. Grothendieck Aug 28 '21 at 13:48
  • But testing small var is not sufficient. The point is a line fit in a parabolic region has a systematic bias that can not be captured by var alone. – user1424739 Aug 28 '21 at 13:55
  • 4
    It worked on the test data and I think you can use this framework to solve the general problem if you spend some time on it with your data to set the tuning parameters. Nonlinear segments are likely to have higher variance. Note that the scoref argument to dpseg can be used to provide alternate scoring and you can spend some time on trying different ones if you don't like variance. – G. Grothendieck Aug 28 '21 at 13:58