How to join dataset on coordinates?

Question

I am analysing Datasets and I need to compare them. The two Datasets got an Index and Coordinates(X,Y) each. The coordinates are not equal so I need to use something like the numpy.isclose (e.g. atol=5) function.

My aim in the comparison is to find similar y coordinates (e.g. y[5]= 101 (Dataset1), y2[7] = 103 (Dataset2)). And I need to compare the x-coordinates of the same indices (e.g. x[5]= 405 (Dataset1), x2[7] = 401 (Dataset2))

My problem is that I cant combine these two isclose functions

I have tried to compare at first the y and afterwards the x coordinates. If it is a separate comparison the function will find other Data as well. (e.g. y[5] = 101, y2[7] = 103; x[5] = 405, x[3] = 402). It needs to compare same indices (5/5 and 7/7).

This is working but gives wrong results:

yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any()}
xres = {i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}

Theoretically i am searching for something like this:

yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any() & i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}

Expect finding points with similar coordinates (e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[7] = 401).

At the moment I receive any similar data (e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[3] = 402).

Bellow Input example (Picture1 and Picture2):

Pict1 Pict2

In this picture I need to identify 4 point pairs (Index pict1 / Index pict2):

6 / 9
7 / 8
17 / 13
20 / 14

Can you post a sample of input data. Could you also provide some details. Do you need the closest points to join, or all within a tolerance (within a ball)? Is a metric like distance `sqrt((x1-x2)**2 + (y1-y2)**2)` between points relevant as join criterion? — jlandercy, Sep 11 '19 at 11:48
The distance is in pixel. I need the points within a choosable tolerance. — prKn, Sep 11 '19 at 13:27
Have you looked at [`geopandas`](http://geopandas.org/)? I'm not totally sure I follow your problem, but I think it's what you're looking for. It's like `pandas` but you can use a **geometry** (point, line or polygon) as the index, so things like spatial joins are very easy to do. — Matt Hall, Sep 11 '19 at 14:51

jlandercy · Accepted Answer · 2019-09-12T14:30:10.177

Forewords

Your question is related to Nearest Neighbors Search (NNS). One way to solve it is to build a spatial index like in Spatial Databases.

A straightforward solution is KD-Tree which is implemented in sklearn.

Questions

At this point it is essential to know what question we want to answer:

Q1.a) Find all points in dataset B which are as close as (distance) points of A within a given threshold atol (radius).

Or:

Q2.a) Find the k closest point in a dataset B with respect to each point of my dataset A.

Both questions can be answered using KD-Tree, what we must realise is:

Questions Q1 and Q2 are different, so are their answers;
Q1 can map 0 or more points together, there is no guaranty about one-to-one mapping;
Q2 will map exactly 1 to k points, there is a guaranty that all points in reference dataset are mapped to k points in search dataset (provided there is enough points);
Q2.a is generally not equivalent to its reciprocal question Q2.b (when datasets A and B are permuted).

MCVE

Lets build a MCVE to address both questions:

# Parameters
N = 50
atol = 50
keys = ['x', 'y']

# Trials Datasets (with different sizes, we keep it general):
df1 = pd.DataFrame(np.random.randint(0, 500, size=(N-5, 2)), columns=keys).reset_index()
df2 = pd.DataFrame(np.random.randint(0, 500, size=(N+5, 2)), columns=keys).reset_index()

# Spatial Index for Datasets:
kdt1 = KDTree(df1[keys].values, leaf_size=5, metric='euclidean')
kdt2 = KDTree(df2[keys].values, leaf_size=5, metric='euclidean')

# Answer Q2.a and Q2.b (searching for a single neighbour):
df1['kNN'] = kdt2.query(df1[keys].values, k=1, return_distance=False)[:,0]
df2['kNN'] = kdt1.query(df2[keys].values, k=1, return_distance=False)[:,0]

# Answer Q1.a and Q1.b (searching within a radius):
df1['radius'] = kdt2.query_radius(df1[keys].values, atol)
df2['radius'] = kdt1.query_radius(df2[keys].values, atol)

Bellow the result for Dataset A as reference:

   index    x    y  kNN    radius
0      0   65  234   39      [39]
1      1  498   49   11      [11]
2      2   56  171   19  [29, 19]
3      3  239   43   20      [20]
4      4  347   32   50      [50]
[...]

At this point, we have everything required to spatially join our data.

Nearest Neighbors `(k=1)`

We can join our datasets using kNN index:

kNN1 = df1.merge(df2[['index'] + keys], left_on='kNN', right_on='index', suffixes=('_a', '_b'))

It returns:

   index_a  x_a  y_a  kNN    radius  index_b  x_b  y_b
0        0   65  234   39      [39]       39   49  260
1        1  498   49   11      [11]       11  487    4
2        2   56  171   19  [29, 19]       19   39  186
3        3  239   43   20      [20]       20  195   33
4        4  347   32   50      [50]       50  382   32
[...]

Graphically it leads to:

And reciprocal question is about:

We see that mapping is exactly 1-to-k=1 all points in reference dataset are mapped to another point in search dataset. But answers differ when we swap reference.

Radius `atol`

We can also join our datasets using the radius index:

rad1 = df1.explode('radius')\
           .merge(df2[['index'] + keys], left_on='radius', right_on='index',
                  suffixes=('_a', '_b'))

It returns:

   index_a  x_a  y_a  kNN radius  index_b  x_b  y_b
0        0   65  234   39     39       39   49  260
2        1  498   49   11     11       11  487    4
3        2   56  171   19     29       29   86  167
4        2   56  171   19     19       19   39  186
7        3  239   43   20     20       20  195   33
[...]

Graphically:

Reciprocal answer is equivalent:

We see answers are identical, but there is no guaranty for a one-to-one mapping. Some points are not mapped (lonely points), some are mapped to many points (dense neighbourhood). Additionally, it requires an extra parameters atol which must be tuned for a given context.

Bonus

Bellow the function to render figures:

def plot(A, B, join, title=''):
    X = join.loc[:,['x_a','x_b']].values
    Y = join.loc[:,['y_a','y_b']].values
    fig, axe = plt.subplots()
    axe.plot(A['x'], A['y'], 'x', label='Dataset A')
    axe.plot(B['x'], B['y'], 'x', label='Dataset B')
    for k in range(X.shape[0]):
        axe.plot(X[k,:], Y[k,:], linewidth=0.75, color='black')
    axe.set_title(title)
    axe.set_xlabel(r'$x$')
    axe.set_ylabel(r'$y$')
    axe.grid()
    axe.legend(bbox_to_anchor=(1,1), loc='upper left')
    return axe

References

Some useful references:

Thank you for your detailed reply! You are right this seems to be the solution of my task. I still got the problem of two Data Sets to compare. I got 2 Pictures (2 sets of coordinates) to compare with prefixed indices (not allowed to change). As i can see this method works with one set of coordinates and gives indices automatically. — prKn, Sep 11 '19 at 13:02
Thanks for this update. I still see the problem that this solution compare every point with each other. I just want to compare Dataset1 with Dataset2. Even with creating a cloumn for source index it would compare Datas from the same set. — prKn, Sep 11 '19 at 13:20
@Carlo, How many points do you intend to join? Also see this [post](https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe) for another method using GIS index. — jlandercy, Sep 11 '19 at 13:24
It is kind of random. The datasets are out of microscope pictures. Theoretically the points should cover each other or there should be a distance in between. Fact is that these two pictures are made with different filters so that the intensity and the size of the points differs. Because of the different size the centercoordinates differs a little. So i am looking for points with almost same positions in two pictures. — prKn, Sep 11 '19 at 13:41
@Carlo, I have addressed your requirement to have data split among two dataset. Now it is up to you to choose which method suits you best. Cheers — jlandercy, Sep 12 '19 at 08:33
@Carlo, Out of curiosity, which method have you chosen to solve you problem? — jlandercy, Sep 12 '19 at 14:02