2

I have several dataframes which each contain two columns of x and y values, so each row represents a point on a curve. The different dataframes then represent contours on a map. I have another series of data points (fewer in number), and I'd like to see which contour they are closest to on average.

I would like to establish the distance from each datapoint to each point on the curve, with sqrt(x^2+y^2) - sqrt(x_1^2 + y_1^2), add them up for each point on the curve. The trouble is that there are several thousand points on the curve, and there are only a few dozen datapoints to assess, so I can't simply put these in columns next to each other.

I think I need to cycle through the datapoints, checking the sqdistance between them and each point in the curve. I don't know whether there is an easy function or module that can do this. Thanks in advance!

Edit: Thanks for the comments. @Alexander: I've tried the vectorize function, as follows, with a sample dataset. I'm actually using contours which comprise several thousand datapoints, and the dataset to compare against are 100+, so I'd like to be able to automate as much as possible. I'm currently able to create a distance measurement from the first datapoint against my contour, but I would ideally like to cycle through j as well. When I try it, it comes up with an error:

import numpy as np
from numpy import vectorize
import pandas as pd
from pandas import DataFrame

df1 = {'X1':['1', '2', '2', '3'], 'Y1':['2', '5', '7', '9']}
df1 = DataFrame(df1, columns=['X1', 'Y1'])
df2 = {'X2':['3', '5', '6'], 'Y2':['10', '15', '16']}
df2 = DataFrame(df2, columns=['X2', 'Y2'])
df1=df1.astype(float)
df2=df2.astype(float)
Distance=pd.DataFrame()

i = range(0, len(df1))
j = range(0, len(df2))

def myfunc(x1, y1, x2, y2):
    return np.sqrt((x2-x1)**2+np.sqrt(y2-y1)**2)

vfunc=np.vectorize(myfunc)
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i]   ['X1'], df1.iloc[i]['Y1'], df2.iloc[0]['X2'], df2.iloc[0]['Y2'])
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[1]['X2'], df2.iloc[1]['Y2'])
Distance
Tom
  • 109
  • 1
  • 9
  • Subtraction is not defined for strings. – Stop harming Monica May 25 '19 at 13:40
  • 1
    @Tom, I've added one example of sulution to my answer, that uses your `df1`, `df2` – imposeren May 25 '19 at 14:32
  • 1
    @Tom, another approach: use `intersecting.area` minus `symmetric_difference.area` as a similarity metric. It might be a more "straightforward" approach because collections now can have different number of points, you can ignore "density irregularities" (when some small part of line has huge number of points and the rest big area is covered just by few), and often you can fully skip all "normalization" steps. See my answer for details. – imposeren May 25 '19 at 15:59

2 Answers2

1

For the distance, you need to change your formula to

def getDistance(x, y, x_i, y_i):
    return sqrt((x_i -x)^2 + (y_i - y)^2)

with (x,y) being your datapoint and (x_i, y_i) being a point from the curve.

Consider using NumPy for vectorization. Explicitly looping through your data points will most likely be less efficient, depending on your use case, it might however be quick enough. (If you need to run it on a regular basis, I think vectorization will easily outspeed the explicit way) This could look something like this:

import numpy as np # Universal abbreviation for the module

datapoints = np.random.rand(3,2) # Returns a vector with randomized entries of size 3x2 (Imagine it as 3 sets of x- and y-values

contour1 = np.random.rand(1000, 2) # Other than the size (which is 1000x2) no different than datapoints
contour2 = np.random.rand(1000, 2)
contour3 = np.random.rand(1000, 2)

def squareDistanceUnvectorized(datapoint, contour):
    retVal = 0.
    print("Using datapoint with values x:{}, y:{}".format(datapoint[0], datapoint[1]))

    lengthOfContour = np.size(contour, 0) # This gets you the number of lines in the vector

    for pointID in range(lengthOfContour):
        squaredXDiff = np.square(contour[pointID,0] - datapoint[0])
        squaredYDiff = np.square(contour[pointID,1] - datapoint[1])
        retVal += np.sqrt(squaredXDiff + squaredYDiff)

    retVal = retVal / lengthOfContour # As we want the average, we are dividing the sum by the element count
    return retVal

if __name__ == "__main__":
    noOfDatapoints = np.size(datapoints,0)
    contID = 0
    for currentDPID in range(noOfDatapoints):
        dist1 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour1)
        dist2 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour2)
        dist3 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour3)
        if dist1 > dist2 and dist1 > dist3:
            contID = 1
        elif dist2 > dist1 and dist2 > dist3:
            contID = 2
        elif dist3 > dist1 and dist3 > dist2:
            contID = 3
        else:
            contID = 0
        if contID == 0:
            print("Datapoint {} is inbetween two contours".format(currentDPID))
        else:
            print("Datapoint {} is closest to contour {}".format(currentDPID, contID))

Okay, now moving on to vector-land.

I have taken the liberty to adjust this part to what I think is your dataset. Try it and let me know if it works.

import numpy as np
import pandas as pd

# Generate 1000 points (2-dim Vector) with random values between 0 and 1. Make them strings afterwards.
# This is the first contour
random2Ddata1 = np.random.rand(1000,2)
listOfX1      = [str(x) for x in random2Ddata1[:,0]]
listOfY1      = [str(y) for y in random2Ddata1[:,1]]

# Do the same for a second contour, except that we de-center this 255 units into the first dimension
random2Ddata2 = np.random.rand(1000,2)+[255,0]
listOfX2      = [str(x) for x in random2Ddata2[:,0]]
listOfY2      = [str(y) for y in random2Ddata2[:,1]]

# After this step, our 'contours' are basically two blobs of datapoints whose centers are approx. 255 units apart.

# Generate a set of 4 datapoints and make them a Pandas-DataFrame
datapoints = {'X': ['0.5', '0', '255.5', '0'], 'Y': ['0.5', '0', '0.5', '-254.5']}
datapoints = pd.DataFrame(datapoints, columns=['X', 'Y'])

# Do the same for the two contours
contour1    = {'Xf': listOfX1, 'Yf': listOfY1}
contour1    = pd.DataFrame(contour1,  columns=['Xf', 'Yf'])

contour2    = {'Xf': listOfX2, 'Yf': listOfY2}
contour2    = pd.DataFrame(contour2,  columns=['Xf', 'Yf'])

# We do now have 4 datapoints.
# - The first datapoint is basically where we expect the mean of the first contour to be.
#   Contour 1 consists of 1000 points with x, y- values between 0 and 1
# - The second datapoint is at the origin. Its distances should be similar to the once of the first datapoint
# - The third datapoint would be the result of shifting the first datapoint 255 units into the positive first dimension
# - The fourth datapoint would be the result of shifting the first datapoint 255 units into the negative second dimension

# Transformation into numpy array
# First the x and y values of the data points
dpArray = ((datapoints.values).T).astype(np.float)
c1Array = ((contour1.values).T).astype(np.float)
c2Array = ((contour2.values).T).astype(np.float)

# This did the following:
# - Transform the datapoints and contours into numpy arrays
# - Transpose them afterwards so that if we want all x values, we can write var[0,:] instead of var[:,0].
#   A personal preference, maybe
# - Convert all the values into floats.

# Now, we iterate through the contours. If you have a lot of them, putting them into a list beforehand would do the job
for contourid, contour in enumerate([c1Array, c2Array]):
    # Now for the datapoints
    for _index, _value in enumerate(dpArray[0,:]):
        # The next two lines do vectorization magic.
        # First, we square the difference between one dpArray entry and the contour x values.
        # You might notice that contour[0,:] returns an 1x1000 vector while dpArray[0,_index] is an 1x1 float value.
        # This works because dpArray[0,_index] is broadcasted to fit the size of contour[0,:].
        dx       = np.square(dpArray[0,_index] - contour[0,:])
        # The same happens for dpArray[1,_index] and contour[1,:]
        dy       = np.square(dpArray[1,_index] - contour[1,:])
        # Now, we take (for one datapoint and one contour) the mean value and print it.
        # You could write it into an array or do basically anything with it that you can imagine
        distance = np.mean(np.sqrt(dx+dy))
        print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))

# But you want to be able to call this... so here we go, generating a function out of it!
def getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, listOfContourDataFrames):
    """ Takes a DataFrame with points and a list of different contours to return the average distance for each combination"""
    dpArray = ((datapoints.values).T).astype(np.float)

    listOfContours = []
    for item in listOfContourDataFrames:
        listOfContours.append(((item.values).T).astype(np.float))

    retVal  = np.zeros((np.size(dpArray,1), len(listOfContours)))
    for contourid, contour in enumerate(listOfContours):
        for _index, _value in enumerate(dpArray[0,:]):
            dx       = np.square(dpArray[0,_index] - contour[0,:])
            dy       = np.square(dpArray[1,_index] - contour[1,:])
            distance = np.mean(np.sqrt(dx+dy))
            print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
            retVal[_index, contourid] = distance

    return retVal

# And just to see that it is, indeed, returning the same results, run it once
getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, [contour1, contour2])

Alexander Mayer
  • 316
  • 2
  • 11
  • This is awesome! I have to go through and try to understand it fully, but I've just tried to write in a test 3x2 dataframe instead of your random generation within the datapoints variable, and it doesn't seem to like it "(0, slice(None, None, None))' is an invalid key", but maybe I have to change the format of what I'm writing in. I'll also have to adapt this, because I'm trying to find out the closest contour not just for each datapoint, but for the set of datapoints (a curve), but I think that should just be a question of taking an average... – Tom May 25 '19 at 11:08
  • 1
    @Tom As I can not comment under your question directly (yet), can you please provide an example dataset for both contours and data points to check? – Alexander Mayer May 25 '19 at 13:43
  • Thanks a million - I've uploaded what I'd tried so far. I've only tried briefly with your code. It works very well for the random data, but I'll need to call in my own datapoints and cycle through 100 or so to. I'll then need to decide for each set (of 100 datapoints) which contour is closest on average. – Tom May 25 '19 at 13:52
  • P.S. - The code works perfectly...but I've noticed that in some cases it gives an unexpected result. Without going into the details, it favours the "flatter" contour lines as more datapoints are closer, whereas some form of correlation distance would be better. I think maybe considering only the three smallest distances for each contour would be an idea. I've posted a simpler question on that elsewhere! Thanks again for your help. – Tom May 26 '19 at 09:40
1

General idea

The "curve" is actually a polygon with a lot's of points. There definetly some libraries to calculate the distance between the polygon and the point. But generally it will be something like:

  1. Calculate "approximate distance" to whole polygon, e.g. to the bounding box of a polygon (from point to 4 line segments), or to the center of bounding box
  2. calculate distances to the lines of a polygon. If you have too many points then as an extra step "resolution" of a polygon might be reduced.
  3. Smallest found distance is the distance from point to the polygon.
  4. repeat for each point and each polygon

Existing solutions

Some libraries already can do that:

Example:


# get distance from points of 1 dataset to all the points of another dataset

from scipy.spatial import distance

d = distance.cdist(df1.to_numpy(), df2.to_numpy(), 'euclidean')

print(d)
# Results will be a matrix of all possible distances:
# [[ D(Point_df1_0, Point_df2_0), D(Point_df1_0, Point_df2_1), D(Point_df1_0, Point_df2_2)]
#  [ D(Point_df1_1, Point_df2_0), D(Point_df1_1, Point_df2_1), D(Point_df1_1, Point_df2_2)]
#  [ D(Point_df1_3, Point_df2_0), D(Point_df1_2, Point_df2_1), D(Point_df1_2, Point_df2_2)]
#  [ D(Point_df1_3, Point_df2_0), D(Point_df1_3, Point_df2_1), D(Point_df1_3, Point_df2_2)]]

[[ 8.24621125 13.60147051 14.86606875]
 [ 5.09901951 10.44030651 11.70469991]
 [ 3.16227766  8.54400375  9.8488578 ]
 [ 1.          6.32455532  7.61577311]]

What to do next is up to you. For example as a metric of "general distance between curves" you can:

  • Pick smallest values in each row and each column (if you skip some columns/rows, then you might end up with candidate that "matches only a part of contour), and calculate their median: np.median(np.hstack([np.amin(d, axis) for axis in range(len(d.shape))])).
  • Or you can calculate mean value of:

    • all the distances: np.median(d)
    • of "smallest 2/3 of distances": np.median(d[d<np.percentile(d, 66, interpolation='higher')])
    • of "smallest distances that cover at least each rows and each columns":
for min_value in np.sort(d, None):
    chosen_indices = d<=min_value
    if np.all(np.hstack([np.amax(chosen_indices, axis) for axis in range(len(chosen_indices.shape))])):
        break

similarity = np.median(d[chosen_indices])

Alternative approach

Alternative approach would be to use some "geometry" library to compare areas of concave hulls:

  • Build concave hulls for contours and for "candidate datapoints" (not easy, but possible: using shapely , using concaveman). But if you are sure that your contours are already ordered and without overlapping segments, then you can directly build polygons from those points without need for concave hull.

  • Use "intersection area" minus "non-common area" as a metric of similarity (shapely can be used for that):

This approach might be better than processing distances in some situations, for example:

  • You want to prefer "fewer points covering whole area" over "huge amount of very close points that cover only half of the area"
  • It's more obvious way to compare candidates with different number of points

But it has it's disadvantages too (just draw some examples on paper and experiment to find them)

Other ideas:

  • instead of using polygons or concave hull you can:

    • build a linear ring from your points and then use contour.buffer(some_distance). This way you ignore "internal area" of the contour and only compare contour itself (with tolerance of some_distance). Distance between centroids (or double of that) may be used as value for some_distance
    • You can build polygons/lines from segments using ops.polygonize
  • instead of using intersection.area - symmetric_difference.area you can:

    • Snap one object to another, and then compare snapped object to original
  • Before comparing real objects you can compare "simpler" versions of the objects to filter out obvious mismatches:

imposeren
  • 4,142
  • 1
  • 19
  • 27
  • This is fantastic - I can get the matrix of distances for all of the contours in question, and then decide which of the options is suitable - probably correlation distance by the looks of it. I won't try the geometric approach as I'm pushed for time, although I understand how that could also give good results. Many thanks for your help! – Tom May 25 '19 at 16:43
  • 1
    @Tom, Main issues with dist-approach are caused by density irregularities: * In curves with variable density points in denser parts should have less impact on final metric * Number of points in 2 similar polygons might be differ greatly→normalize by nr(points): use averages not sum. Some "distances" might already do that: I suggest to search info on every option from `scipy.spatial.distance` and investigate (...Mahalanobis distance looks good). But if points are evenly distributed, then plain average is fine similarity metric: `similarity = np.mean(distance.cdist(np_a1, np_a2, 'euclidean'))` – imposeren May 25 '19 at 17:15
  • 1
    @Tom, I've added several code samples of "calculate median of smallest distances...". I suggest you to use several metrics at once (median over all distances, and median over smallest subset of distances) and on first iteration of your algorithm return candidates based on all of the metric options. And then visually check which metric is better. Or at least make it easy to switch between them different metric approaches. – imposeren May 25 '19 at 18:12