2

I have a handful of data points that cluster along a line in 3d space. I have the x,y,z data in a csv file that I want to import. I would like to find an equation that represents that line, or the plane perpendicular to that line, or whatever is mathematically correct. These data are independent of each other. Maybe there are better ways to do this than what I tried to do but...

I attempted to replicate an old post here that seemed to be doing exactly what I'm trying to do Fitting a line in 3D

but it seems that maybe updates over the past decade have left the second part of the code not working? Or maybe I'm just doing something wrong. I've included the entire thing that I frankensteined together from this at the bottom. There are two lines that seem to be giving me a problem.

I've snippeted them out here...

import numpy as np

pts = np.add.accumulate(np.random.random((10,3)))
x,y,z = pts.T

# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]

# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]

# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
    x = (z - c_xz)/m_xz
    y = (z - c_yz)/m_yz
    return x,y

#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()

They return this, but no values...

FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions. To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1. m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0] FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions. To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1. m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]

I don't know where to go from here. I don't even actually need the plot, I just needed an equation and am ill-equipped to move forward. If anyone knows an easier way to do this, or can point me in the right direction, I'm willing to learn, but I'm very, very lost. Thank you in advance!!

Here is my entire frankensteined code in case that is what is causing the issue.

import pandas as pd
import numpy as np
mydataset = pd.read_csv('line1.csv')

x = mydataset.iloc[:,0]
y = mydataset.iloc[:,1]
z = mydataset.iloc[:,2]


data = np.concatenate((x[:, np.newaxis], 
                       y[:, np.newaxis], 
                       z[:, np.newaxis]), 
                      axis=1)


# Calculate the mean of the points, i.e. the 'center' of the cloud
datamean = data.mean(axis=0)

# Do an SVD on the mean-centered data.
uu, dd, vv = np.linalg.svd(data - datamean)

# Now vv[0] contains the first principal component, i.e. the direction
# vector of the 'best fit' line in the least squares sense.

# Now generate some points along this best fit line, for plotting.

# we want it to have mean 0 (like the points we did
# the svd on). Also, it's a straight line, so we only need 2 points.
linepts = vv[0] * np.mgrid[-100:100:2j][:, np.newaxis]

# shift by the mean to get the line in the right place
linepts += datamean

# Verify that everything looks right.

import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d as m3d

ax = m3d.Axes3D(plt.figure())
ax.scatter3D(*data.T)
ax.plot3D(*linepts.T)
plt.show()

# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]

# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]

# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
    x = (z - c_xz)/m_xz
    y = (z - c_yz)/m_yz
    return x,y

print(x,y)

#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
  • Can you give us a snippet of your dataset? Try adding the first 20 lines to your question using `mydataset.head(20)` – cfort Jan 25 '21 at 17:00
  • Did you just want to see the data? I have the csv file here. It's only like 17 points right now, but I have a few hundred points on different lines that will come and go which is why I wanted it to be excel compatible. Thank you for your patience lol. https://drive.google.com/file/d/1I0SkBnpzFTHS7jMWSotegVsjl-pxH2bO/view?usp=sharing – Shelby Johnston Jan 25 '21 at 17:16
  • @ShelbyJohnston, you should add your data as a list or numpy array to the code snippet so that we can run your code without downloading any data. – EMiller Jan 25 '21 at 17:39
  • I had left it as importing the csv because I thought that might be the problem. Now I see that its not. I copied the original code, since I know it isn't wrong (and I don't know what I'll screw up by doing it myself) and pasted it in the snippet. They just used random values and it gives the exact same error... – Shelby Johnston Jan 25 '21 at 18:01

2 Answers2

4

As was proposed in the old post you refer to, you could also make use of principal component analysis instead of a least squares approach. For that I suggest sklearn.decomposition.PCA from the sklearn package.

An example can be found below using the csv-file you provided.

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

mydataset = pd.read_csv('line1.csv')

x = mydataset.iloc[:,0]
y = mydataset.iloc[:,1]
z = mydataset.iloc[:,2]

coords = np.array((x, y, z)).T

pca = PCA(n_components=1)
pca.fit(coords)
direction_vector = pca.components_
print(direction_vector)


# Create plot
origin = np.mean(coords, axis=0)
euclidian_distance = np.linalg.norm(coords - origin, axis=1)
extent = np.max(euclidian_distance)

line = np.vstack((origin - direction_vector * extent,
                  origin + direction_vector * extent))

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(coords[:, 0], coords[:, 1], coords[:,2])
ax.plot(line[:, 0], line[:, 1], line[:, 2], 'r')

enter image description here

Henri
  • 348
  • 2
  • 8
1

You can get rid of the complaint from leastsquares by adding rcond=None like this:

m_xz, c_xz = np.linalg.lstsq(A_xz, z, rcond=None)[0]

Is this the right decision for your situation? I have no idea. But there's more about it in the docs.

When I run your code with your inputs it seems to run just fine and I get values assigned to m_xz, c_xz, etc. If you don't call them explicitly with print('m_xz') (or whatever) then you won't see them.

m_xz
Out[42]: 5.186132604596112

c_xz
Out[43]: 62.5764694106141

Also, you reference your data in kind of two different ways. You get x, y, and z from your csv, but also put it into a numpy array. You can get rid of the duplication and pandas by just using numpy:

data = np.genfromtxt('line1.csv', delimiter=',', skip_header=1)

x = data[:,0]
y = data[:,1]
z = data[:,2] 
cfort
  • 2,655
  • 1
  • 19
  • 29