Efficiently calculating a segmented regression on a large dataset

Question

I currently have a large data set, for which I need to calculate a segmented regression (or fit a piecewise linear function in some similar way). However, I have both a large data set, as well as a very large number of pieces.

Currently I have the following approach:

Let s_i be the end of segment i
Let (x_i,y_i) denote the i-th data point

Assume the data point x_k lies within segment j, then I can create a vector from x_k as

(s₁,s₂-s₁,s₃-s₂,...,x_k-s_j-1,0,0,...)

To do a segmented regression on the data point, I can do a normal linear regression on each of these vectors.

However, my current estimates show, that if I define the problem that way, I will get about 600.000 vectors with about 2.000 components each. I haven't benchmarked yet, but I don't think my computer will be able to calculate such a large regression problem in any acceptable time.

Is there a better way to calculate this kind of regression problem? One idea was to maybe use some kind of hierarchical approach, i.e. calculate one regression problem by combining multiple segments, so that I can determine start and endpoints for this set. Then calculate an individual segmented regression for this set of segments. However, I cannot figure out how to calculate the regression for this set of segments, so that the endpoints match (I can only match start or endpoint by fixing the intercept but not both).

Another idea I had was to calculate an individual regression for each of the segments and then only use the slope for that segment. However with that approach, errors might start to accumulate and I have no way to control for this kind of error accumulation.

Yet another ideas is that I could do individual regression for each segment, but fix the intercept to the endpoint of the previous segment. However, I still am not sure, if I may get some kind of error accumulation this way.

Clarification

Not sure if this was clear from the rest of the question. I know where the segments start and end. The most important part is, that I have to get each line segment to intersect at the segment boundary with the next segment.

EDIT

Maybe another fact that could help. All points have different x values.

There should be around 300 points per segment.Thinking some more about the problem, there may actually be some segments without any points. Before I was thinking to set the segments so that I will always have some points in them, but I just noticed, that I might join some parts which could have a different slope. — LiKao, Mar 24 '15 at 12:30

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

I would group points to rectangular grid areas

based on their position. So you process this task on more smaller datasets and then merge the results together when all done.

I would process each group like this:

compute histogram of angles
take only the most occurring angles

their count determine the number of line segments present in group
do the regression/line fit for these angles

See this Answer it does something very similar (just single line)
compute the intersection points

between line segments to get the endpoints of your piecewise polyline and also connectivity info (join the closest endpoints)

[edit1] after OP edit

You know the edge x coordinates of all segments (x0,x1,...) so just compute average y coordinates of points near segment edge (gray area, green points) and You got the segment line endpoints (blue points). Of coarse this is no fit or regression because of discard all the other points so it leads to bigger errors (unless the segment x coordinated corresponds to regressed lines ...) but there is no way around it with the constrains of solution you have (at least I do not see any).

Because if you use regression on segment data then you can not connect it to other segments and if you try to merge them then you got almost the same result as this:

example

the size of gray area determine the output ... so play with it a bit ...

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 24 '15 at 12:28

Spektre

49,595
11
110
380

I am not sure that I understand correctly. As far as I understand this method tries to find a polyline which fits a set of point best, but none of the coordinates of the vertices are known. Is this understanding correct? However, for my data I need the vertices to be at a fixed x coordinate. Hence the segmented regression approach, which ensures that the segments meet at the x coordinates of the vertices. If I calculate the slopes of each segment individually, they may meet at different x coordinates. – LiKao Mar 24 '15 at 12:54
@LiKao this find line/axis correspond to point cloud, you need then connect all found lines together based on endpoint distances... btw is your data a function or not (hence are there more `y` values possible for single `x` ?) an exampleof data input would be fine (2D plot) – Spektre Mar 24 '15 at 13:10
@LiKao after reading your edit I updated my answer (hope I get it right) – Spektre Mar 24 '15 at 18:38
I was thinking about a similar solution before as well... However, I did not follow it yet, because the error for the points near the segment boundary gets amplified a lot with this method. I guess I will have to try different methods on a small dataset and see how they compare. – LiKao Mar 25 '15 at 10:18
Also for your comment "if you use regression on segment data then you can not connect it to other segments". This actually is not that big of a problem. The regression method I explained in my question easily connects the segments, however the complexity is to high to use it. I can also easily iteratively calculate segments, and then connect to the end point of the previous or next segment (by fixing the intercept), but not to both the previous and next endpoints. Also if I do this, I worry about stability. Errors may accumulate, and I cannot prove that they wont. – LiKao Mar 25 '15 at 10:22
@LiKao how many segments are there? with known x of endpoints the regresion is just `O(n*log(a)*log(y))` where `n` is points per segment (~300), `a` is possible angles (slopes... can be limited to for example 3600) and `y` possible offsets ((ymax-ymin) /accuracy) so that is not that expensive. also you can use first point y as the last from previous segment and the last `log` will be dismissed ... you can use approx from here http://stackoverflow.com/q/29166819/2521214 – Spektre Mar 25 '15 at 10:41
Not sure how you derived this (especially the log(a) part). As far as I understand regression, it relies on computing the Pseudoinverse of a matrix. In my case the matrix would be number of segments X number of points so 2.000X600.000. Then computing the pseudoinverse used Gaussian algorithm which runs in O(N^3). You seem to have some other regression method in mind though. Segmenting and using the previous endpoint may lead to instability (at least I cannot prove that it doesn't). – LiKao Mar 25 '15 at 10:48
@LiKao no ... you have to separate each segment as single regression/fit with points only inside its interval (or may be + half of previous and next segment) so the point count is ~300 or ~600, and now you want find line that is closest to all points ... – Spektre Mar 25 '15 at 10:57
@LiKao so try `all` combinations of `y(x)=y0+(x-x0)*(y1-y0)/(x1-x0)` so just y0,y1 are the unknonws ... and `(y1-y0)/(x1-x0)=tan(angle)` but you can try not all combination but approximate by divide et impera like approach (by mine approx class or CCD or whatever) so instead of `m` tries per variables you got `log(m)` ... and inside each try you compute the avg of abs distance of all points from that line and store the minimum solution ... I know its line fitting instead of regression but the result should be close if not the same but in much much less time – Spektre Mar 25 '15 at 10:58
@LiKao the thing you are describing looks like high order polynomial regression/fit (not segmented line ...) – Spektre Mar 25 '15 at 11:50

Efficiently calculating a segmented regression on a large dataset

1 Answers1

Linked