Is there a way to plot a 'tight' best fit line through scattered X, Y coordinates?

Question

I'm having trouble trying to fit an average curved line through my data in order to find the length. I have a lot of X, Y points in a large pandas dataframe that looks something like:

x = np.asarray([731501.13, 731430.24, 731360.29, 731289.36, 731909.72, 731827.89,
   731742.  , 731657.74, 731577.95, 731502.64, 731430.39, 731359.12,
   731287.3 , 731214.21, 732015.59, 731966.88, 731902.67, 731826.31,
   731743.79, 731660.94, 731581.29, 731505.4 , 731431.95, 732048.71,
   732026.66, 731995.46, 731952.18, 731894.29, 731823.58, 731745.16,
   732149.61, 732091.53, 732052.98, 732026.82, 732005.17, 731977.63,
   732691.84, 732596.62, 732499.45, 732401.62, 732306.18, 732218.35,
   732141.82, 732080.91, 732038.21, 732009.08, 733023.08, 732951.99,
   732873.32, 732787.51])

y = np.asarray([7873771.69, 7873705.34, 7873638.03, 7873571.73, 7874082.33,
   7874027.2 , 7873976.22, 7873923.58, 7873866.35, 7873804.53,
   7873739.58, 7873673.62, 7873608.23, 7873544.15, 7874286.21,
   7874197.15, 7874123.96, 7874063.21, 7874008.78, 7873954.69,
   7873897.31, 7873836.09, 7873772.38, 7874564.62, 7874448.23,
   7874341.23, 7874246.59, 7874166.93, 7874100.4 , 7874041.77,
   7874912.56, 7874833.09, 7874733.62, 7874621.43, 7874504.65,
   7874393.89, 7875225.26, 7875183.85, 7875144.42, 7875105.69,
   7875064.49, 7875015.5 , 7874954.94, 7874878.36, 7874783.13,
   7874674.  , 7875476.18, 7875410.05, 7875351.67, 7875300.61])

The x and y are map view coordinates and I want to calculate the length. I can code the Euclidean distance but because the points are scattered and aren't one point after another, I'm having trouble trying to fit a moving line through this. I've tried polyfit but this mainly produces a straight line even with higher deg, e.g:

from numpy.polynomial.polynomial import polyfit
import numpy as np
import matplotlib.pyplot as plt
z = np.polyfit(x,y,10) 
p = np.poly1d(z)
plt.scatter(x,y, marker='x')
plt.scatter(x, p(x), marker='.')

plt.show()

This is to demonstrate what I mean 1

Any help would be greatly appreciated!

I've added a link to the main post to show what I mean. I'd like these map coordinates to basically become one continuous line instead of being scattered points. However any type of best fit line only approximately goes through it, which is not what I want as I want to measure the length of the data. — Sophie, May 23 '19 at 15:35
Your fit quality is bad because there are points near X=732100 with widely varying Y coordinates. If you see the Y data as a function of X, it means the fitting function must create a big slope. `np.polyfit` cannot do that without screwing up fit quality elsewhere, but the fundamental issue is that fitting map coordinates is a different game from fitting a function (for instance, if you rotate the whole dataset 45°, your fitting procedure should produce the same tendency line rotated 45°). (See https://stackoverflow.com/questions/22691682/rotation-invariant-curve-fitting, similar problem) — Leporello, May 23 '19 at 15:37
Is your objective just to measure the length of the sequence (x_1, y_1) to (x_n, y_n). Would it make sense to just sum the distances between contiguous segments? — xibalba1, May 23 '19 at 15:41
@Leporello Makes sense, np.polyfit was the only way of getting something that resembled what I wanted, but I get that I don't want to fit a function to something like this. I was wondering if it's possible to break up the data into segments and connect the lines up that way if that makes sense? — Sophie, May 23 '19 at 15:55
@xibalba1 Yes I want to do this but I'm not sure how to when the X, Y coordinates are scattered like in the data shown — Sophie, May 23 '19 at 15:58
Do you have any uncertainties on the X and Y values? How well are they known? @Sophie — jtlz2, May 23 '19 at 19:53
@jtlz2 It's not known at all - for the purposes of what I want to do it is just to get an automated line that goes through the points because I have 200 more data sets like this. — Sophie, May 24 '19 at 12:00
@Sophie no problem. And do you have any physical motivation for the shapes of the lines? What actually are they? — jtlz2, May 24 '19 at 12:28
@jtlz2 They're actually geological faults in the subsurface! It's basically a 3D plane but because of the way it's been mapped (taking consideration of depth as well) this birds eye view of the coordinates will show this type of scatter. I am interested in measuring the length and the width of this scattered data. This is considered to be one fault, however I would love to automate the process because I have hundreds of faults in my data. — Sophie, May 24 '19 at 13:19
OK cool. Do you have a function for mapping that plane to the surface projection? — jtlz2, May 25 '19 at 09:10
@jtlz2 Nope X and Y is the raw data, although I do have a Z array too which puts it in the 3rd dimension but it is not consistent data. 'Slicing' this on one horizontal Z plane didn't work for me as it varied too much for different faults/data sets. — Sophie, May 25 '19 at 09:51

score 2 · Answer 1 · answered May 24 '19 at 06:32

2

This would be an empiric function fitting your data:

import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit


x = np.asarray([731501.13, 731430.24, 731360.29, 731289.36, 731909.72, 731827.89,
   731742.  , 731657.74, 731577.95, 731502.64, 731430.39, 731359.12,
   731287.3 , 731214.21, 732015.59, 731966.88, 731902.67, 731826.31,
   731743.79, 731660.94, 731581.29, 731505.4 , 731431.95, 732048.71,
   732026.66, 731995.46, 731952.18, 731894.29, 731823.58, 731745.16,
   732149.61, 732091.53, 732052.98, 732026.82, 732005.17, 731977.63,
   732691.84, 732596.62, 732499.45, 732401.62, 732306.18, 732218.35,
   732141.82, 732080.91, 732038.21, 732009.08, 733023.08, 732951.99,
   732873.32, 732787.51])/732 -1000

y = np.asarray([7873771.69, 7873705.34, 7873638.03, 7873571.73, 7874082.33,
   7874027.2 , 7873976.22, 7873923.58, 7873866.35, 7873804.53,
   7873739.58, 7873673.62, 7873608.23, 7873544.15, 7874286.21,
   7874197.15, 7874123.96, 7874063.21, 7874008.78, 7873954.69,
   7873897.31, 7873836.09, 7873772.38, 7874564.62, 7874448.23,
   7874341.23, 7874246.59, 7874166.93, 7874100.4 , 7874041.77,
   7874912.56, 7874833.09, 7874733.62, 7874621.43, 7874504.65,
   7874393.89, 7875225.26, 7875183.85, 7875144.42, 7875105.69,
   7875064.49, 7875015.5 , 7874954.94, 7874878.36, 7874783.13,
   7874674.  , 7875476.18, 7875410.05, 7875351.67, 7875300.61])/7873 - 1000

def my_func( x, x0, y0, a, b, c, t, s):
    xs = x-x0
    p = a * xs**3 + b * xs**2 + c * xs + y0
    t = t * np.tanh( s * xs )
    return p + t

xth = np.linspace( -1.15, 1.5, 50 )
yth = my_func( xth, 0.03, 0.18, .01, 0, 0.05, .05 , 10)

sol, err = curve_fit( my_func, x, y, p0=[0.03, 0.18, .01, 0, 0.05, .05 , 10] ) 
print sol 
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.scatter( x, y )
ax.plot( xth, yth )
ax.plot( xth, my_func( xth, *sol) )
plt.show()

giving

>>[ 2.86281016e-02  1.95292660e-01  9.62290944e-03 -1.26304655e-02 5.11281073e-02  4.63955967e-02  1.02260568e+01]

and

answered May 24 '19 at 06:32

mikuszefski

3,943
1
25
38

1

Note that I scaled and shifted your data a bit to get a better feeling for the values. – mikuszefski May 24 '19 at 06:34
Same question as to James - how have you chosen the function and the order of the polynomial etc.? – jtlz2 May 24 '19 at 09:25
1

@jtlz2 There is, unfortunately, no general answer to this. To my experience many *sharper* transitions can be fitted by `tanh`. Here, the presence of an `x0` and `y0` is obvious, as is a linear term. As the data bends away at the edges order three is also required. My first try didn't have the quadratic term, but it turned out to significantly improve the fit ( inspection by eye). In total I think that is the lowest set of assumptions providing all characteristics of the data. – mikuszefski May 24 '19 at 10:47
You can quantify it - Occam's Razor using the Bayesian evidence, or a Delta Chisq if you can't be bothered.. :) – jtlz2 May 24 '19 at 11:16
1

@jtlz2 What do you need quantified? In a next step you could probably quantify that fourth order is reduced chi-square wise not justified. – mikuszefski May 24 '19 at 11:31
@mikuszefski I defer to your obviously superior solution that perfectly answers the question that was asked, and properly hang my head in programmatic and statistical shame before the excellence of your reply here. – James Phillips May 24 '19 at 12:57
1

@JamesPhillips I just wanted to show you the beauty of `tanh` again. ( And considering your *hammering on this for a few hours*, I'll not tell you how many minutes I needed. Cheers. ) – mikuszefski May 24 '19 at 14:03
@mikuszefski thank for not rubbing "time salt" into the professional wound that I gave to myself. – James Phillips May 24 '19 at 15:11
Massive thanks to both for the help, I really appreciate it! @James Phillips – Sophie May 25 '19 at 09:13

James Phillips · Answer 2 · 2019-05-23T23:15:22.807

Here is what I came up with after hammering on this for a few hours. I began by observing that there are approximately two data regions, the lower half and the upper half of the data ranges, with different characteristics in each half. The upper half is flatter with fewer data points, and the lower half has more curvature with small groups of nearly overlapping data points. Below is my attempt to separately model these two regions as a first cut at the problem. I have included a "zoomed" plot showing the disjointed overlap region which makes this code unsatisfactory in its present form. I feel confident that I could beat on this for another day or two and get it into better shape, but this solution might not be what you need.

import numpy
import matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

cutoffVal = 732200.0 # x below or above this value

xData = numpy.asarray([731501.13, 731430.24, 731360.29, 731289.36, 731909.72, 731827.89,
   731742, 731657.74, 731577.95, 731502.64, 731430.39, 731359.12,
   731287.3, 731214.21, 732015.59, 731966.88, 731902.67, 731826.31,
   731743.79, 731660.94, 731581.29, 731505.4, 731431.95, 732048.71,
   732026.66, 731995.46, 731952.18, 731894.29, 731823.58, 731745.16,
   732149.61, 732091.53, 732052.98, 732026.82, 732005.17, 731977.63,
   732691.84, 732596.62, 732499.45, 732401.62, 732306.18, 732218.35,
   732141.82, 732080.91, 732038.21, 732009.08, 733023.08, 732951.99,
   732873.32, 732787.51])


yData = numpy.asarray([7873771.69, 7873705.34, 7873638.03, 7873571.73, 7874082.33,
   7874027.2, 7873976.22, 7873923.58, 7873866.35, 7873804.53,
   7873739.58, 7873673.62, 7873608.23, 7873544.15, 7874286.21,
   7874197.15, 7874123.96, 7874063.21, 7874008.78, 7873954.69,
   7873897.31, 7873836.09, 7873772.38, 7874564.62, 7874448.23,
   7874341.23, 7874246.59, 7874166.93, 7874100.4, 7874041.77,
   7874912.56, 7874833.09, 7874733.62, 7874621.43, 7874504.65,
   7874393.89, 7875225.26, 7875183.85, 7875144.42, 7875105.69,
   7875064.49, 7875015.5, 7874954.94, 7874878.36, 7874783.13,
   7874674. , 7875476.18, 7875410.05, 7875351.67, 7875300.61])


# split off data into above and below cutoff
xAboveList = []
yAboveList = []
xBelowList = []
yBelowList = []
for i in range(len(xData)):
    if xData[i] > cutoffVal:
        xAboveList.append(xData[i])
        yAboveList.append(yData[i])
    else:
        xBelowList.append(xData[i])
        yBelowList.append(yData[i])

xAbove = numpy.array(xAboveList)        
xBelow = numpy.array(xBelowList)        
yAbove = numpy.array(yAboveList)        
yBelow = numpy.array(yBelowList)        

# to fit for data above the cutoff value use a quadratic logarithmic equation
def aboveFunc(x, a, b, c):
    return a + b*numpy.log(x) + c*numpy.power(numpy.log(x), 2.0)

# to fit for data below the cutoff value use a hyperbolic type with offset
def belowFunc(x, a, b, c):
    val = x - cutoffVal
    return val / (a + (b * val) - ((a + b) * val * val)) + c

# some initial parameter values
initialParameters_above = numpy.array([1.0, 1.0, 1.0])
initialParameters_below = numpy.array([-4.29E-04, 4.31E-04,  7.87E+06])

# curve fit the equations individually to their respective data
aboveParameters, pcov = curve_fit(aboveFunc, xAbove, yAbove, initialParameters_above)
belowParameters, pcov = curve_fit(belowFunc, xBelow, yBelow, initialParameters_below)

# for plotting the fitting results
xModelAbove = numpy.linspace(max(xBelow), max(xAbove))
xModelBelow = numpy.linspace(min(xBelow), max(xBelow))
y_fitAbove = aboveFunc(xModelAbove, *aboveParameters)
y_fitBelow = belowFunc(xModelBelow, *belowParameters)

plt.plot(xData, yData, 'D') # plot the raw data as a scatterplot
plt.plot(xModelAbove, y_fitAbove) # plot the above equation using the fitted parameters
plt.plot(xModelBelow, y_fitBelow) # plot the below equation using the fitted parameters
plt.show()

print('Above parameters:', aboveParameters)
print('Below parameters:', belowParameters)

@jtlz2 with RMSE, R-squared and visual inspection of the plotted fits. One equation search each for both the upper and lower data ranges. — James Phillips, May 23 '19 at 19:55
Sounds good. Would be interested to hear a bit more about those figures of merit for the different functions you tried. chi-by-eye is surely to be avoided!? :-S — jtlz2, May 24 '19 at 09:24
@jtlz2 RMSE is analogous to "average magnitude of error". R-squared reports how much of the variance in the dependent variable is explained by the model, and is both approximate and useful for non-linear modeling. Visual inspection of the plotted function ensures a smooth curve that lacks what is named "Runge's phenomenon" for high-order polynomials. I did not use chi-by-eye, though I understand the term. — James Phillips, May 24 '19 at 12:36

Is there a way to plot a 'tight' best fit line through scattered X, Y coordinates?

2 Answers2