I have some 2D data where a large number of the rows obey exactly one of a few linear relationships. It's easy to identify the lines when the data is plotted:
How can identify the slopes and intercepts of these lines?
Although which linear relationship applies is a deterministic process based on another variable, that variable has been lost. I don't care that I won't be able to predict new values; I just want all the slopes and intercepts.
If the intercepts are zero, the algorithm is relatively easy. Simply compute r = y/x
for every point, round it to some precision, then identify the most frequent r
. However, this won't generalize when the intercepts are nonzero.
Reproducible data:
library(data.table)
div <- function(i, d) {
{i %% d} == 0L
}
DT <- data.table(x = runif(1e6, 1, 100e3), i = seq_len(1e6))
DT[, y := 0.8 * x + 23333]
DT[div(i, 3), y := 0.3 * x + 14444]
DT[div(i, 7), y := 1.7 * x + 8888]
DT[1:50e3, y := y + runif(.N, -20e3, 20e3)]
One process I've tried to do is to perform a cross-join, calculating the slopes between a sample of points with all other points. In this case, it does identify the slopes; however, this requires only a small minority of points to be off the lines and may be a bit inefficient.
CJ1 <- function(seq., siz = 500) {
CJ(i1 = seq.,
i2 = sample.int(1e6, size = siz)) %>%
.[DT, on = "i1==i", nomatch = 0L] %>%
.[DT[, .(x1 = x, y1 = y, i2 = i)], on = "i2", nomatch = 0L] %>%
.[, m := round((y - y1) / (x - x1), 3)] %>%
.[, .N, keyby = .(m)] %>%
.[order(-N)] %>%
.[N > (2 * N[20])]
}
Are there any established modelling techniques to extract such linear relationships?