1

I have a dataframe of all_points and their coordinates:

all_points =
   point_id   latitude  longitude  
0          1  41.894577 -87.645307  
1          2  41.894647 -87.640426 
2          3  41.894713 -87.635513 
3          4  41.894768 -87.630629  
4          5  41.894830 -87.625793 

and a dataframe of the parent_points:

parent_pts = 
       parent_id
0       1             
1       2     

I want to create a column on the all_points dataframe with the closest parent point to each point.

This is my trial, but I might be making it more complicated:

from scipy.spatial.distance import cdist

def closest_point(point, points):
    """ Find closest point from a list of points. """
    return points[cdist([point], points).argmin()]

def match_value(df, col1, x, col2):
    """ Match value x from col1 row to value in col2. """
    return df[df[col1] == x][col2].values[0]

all_points['point'] = [(x, y) for x,y in zip(all_points['latitude'], all_points['longitude'])]
parent_pts['point'] = all_points['point'][all_points['point_id   '].isin(parent_pts['parent_id'])]

all_points['parent'] = [match_value(parent_pts, 'point', x, 'parent_id') for x in all_points['closest']]

The parent_point is a subset of the all_points.

I get this error when I try to use the closest_point function:

ValueError: XB must be a 2-dimensional array.
Helk
  • 121
  • 10
  • You say: _"I want to create a column on the all_points dataframe with the **closest parent point to each point**."_ How do you know the distance to the "parent"? Your `parent_pts` `DataFrame` does not include any useful information about parent points such as their coordinates. – AGN Gazer Jul 17 '17 at 00:52
  • the coordinate is the same listed in the all_points if the id is the same. – Helk Jul 17 '17 at 00:53

2 Answers2

2

First, let me start by saying that it appears to me that your longitudes and latitudes are locations on Earth. Assuming that Earth is a sphere, the distance between two points should be computed as the length along great-circle distance and not as Euclidean distance that you get using cdist.

The easiest approach from the programming point of view (except for the learning curve for you) is to use the astropy package. They have quite an OK documentation sometimes with useful examples, see, e.g., match_coordinates_sky() or catalog matching with astropy.

Then you might do something like this:

>>> from astropy.units import Quantity
>>> from astropy.coordinates import match_coordinates_sky, SkyCoord, EarthLocation
>>> from pandas import DataFrame
>>> import numpy as np
>>>
>>> # Create your data as I understood it:
>>> all_points = DataFrame({'point_id': np.arange(1,6), 'latitude': [41.894577, 41.894647, 41.894713, 41.894768, 41.894830], 'longitude': [-87.645307, -87.640426, -87.635513, -87.630629, -87.625793 ]})
>>> parent_pts = DataFrame({'parent_id': [1, 2]})
>>>
>>> # Create a frame with the coordinates of the "parent" points:
>>> parent_coord = all_points.loc[all_points['point_id'].isin(parent_pts['parent_id'])]
>>> print(parent_coord)
    latitude  longitude  point_id
0  41.894577 -87.645307         1
1  41.894647 -87.640426         2
>>>
>>> # Create coordinate array for "points" (in principle the below statements
>>> # could be combined into a single one):
>>> all_lon = Quantity(all_points['longitude'], unit='deg')
>>> all_lat = Quantity(all_points['latitude'], unit='deg')
>>> all_pts = SkyCoord(EarthLocation.from_geodetic(all_lon, all_lat).itrs, frame='itrs')
>>>
>>> # Create coordinate array for "parent points":
>>> parent_lon = Quantity(parent_coord['longitude'], unit='deg')
>>> parent_lat = Quantity(parent_coord['latitude'], unit='deg')
>>> parent_catalog = SkyCoord(EarthLocation.from_geodetic(parent_lon, parent_lat).itrs, frame='itrs')
>>> 
>>> # Get the indices (in parent_catalog) of parent coordinates
>>> # closest to each point:
>>> matched_indices = match_coordinates_sky(all_pts, parent_catalog)[0]
Downloading http://maia.usno.navy.mil/ser7/finals2000A.all
|========================================================================| 3.1M/3.1M (100.00%)         0s
>>> all_points['parent_id'] = [parent_pts['parent_id'][idx] for idx in matched_indices]
>>> print(all_points)
    latitude  longitude  point_id  parent_id
0  41.894577 -87.645307         1          1
1  41.894647 -87.640426         2          2
2  41.894713 -87.635513         3          2
3  41.894768 -87.630629         4          2
4  41.894830 -87.625793         5          2

I would like to add that match_coordinates_sky() returns not only matching indices but also a list of angular separations between the data point and the matched "parent" point as well as distance in meters between the data points and the matched "parent" point. It may be useful for your problem.

AGN Gazer
  • 8,025
  • 2
  • 27
  • 45
  • Thank you. This is very helpful. If I wanted to add a distance column, is this correct? `all_points['dist_to_parent'] = [matched_distances[idx] for idx in matched_indices]`, where `matched_distances = match_coordinates_sky(all_pts, parent_catalog)[2]` ? – Helk Jul 17 '17 at 04:02
  • 1
    I would probably try to avoid calling `match_coordinates_sky()` multiple times by doing something like this: `matched_indices, _, matched_distances = match_coordinates_sky(all_pts, parent_catalog)`. Then, `all_points['dist_to_parent'] = matched_distances` should do it. Simpler than you think :) – AGN Gazer Jul 17 '17 at 04:16
  • Lastly, would you mind explaining why the distances for points that are parent_points is not zero? And why you chose the ITRS frame? – Helk Jul 17 '17 at 04:46
  • 1
    Why did I choose ITRS? Simply because I found it in an example and a quick look at the list of built-in frames showed that `itrs` was the only "terrestrial" frame (it is possible that I missed something...) The small distances that you see are likely rounding errors. For example angular separations between positions of points with *themselves*(!!!) are: `all_pts.transform_to(parent_catalog[0]).separation(all_pts)` which produces ``. I would just neglect such small differences. – AGN Gazer Jul 17 '17 at 05:47
  • However, if you are worried about these rounding errors - open an issue on https://github.com/astropy/astropy - someone should help. – AGN Gazer Jul 17 '17 at 05:49
  • I was trying to expand on what was done here to get a list of all parent points within a specific buffer with the function `search_around_3d` from the astropy package: `matched_idx1,matched_idx2,_ = search_around_3d(all_pts, parent_catalog,3200*u.m)` However, the distlimit input gives me this error: `TypeError: 'tuple' object is not callable` Do you know what the problem could be? and is this the right function to do this? – Helk Jul 19 '17 at 15:49
  • 1
    @Helk Weird. Your code works for me, except `search_around_3d` returns a tuple of 4 values => I had to modify your call to `matched_idx1, matched_idx2, _, _= search_around_3d(...)` (notice two underscores). – AGN Gazer Jul 19 '17 at 16:21
  • `search_around_3d(all_pts, parent_catalog,320*u.m)` produces `(array([0, 1]), array([0, 1]), , )`. For `distlimit=620*u.m` I get (array([0, 0, 1, 1, 2]), array([0, 1, 0, 1, 1]), , ) – AGN Gazer Jul 19 '17 at 16:24
  • It worked now, but how can I have a dataframe with the original all_points and the list of parent points in the buffer of each point? I'm confused with this indexed output – Helk Jul 19 '17 at 19:41
  • @Helk I am sorry, I do not understand your question. Could you please expand it or reformulate it. I do not see any problem. – AGN Gazer Jul 19 '17 at 19:45
  • I'd like to generate a dataframe that informs for each point, what are the parent points nearby (in the buffer). How can I produce it from the output obtained with search_around_3d? Column 1: point; Column 2: array of parent points nearby – Helk Jul 19 '17 at 19:53
  • I think I got it! `parents_in_buff = DataFrame({point':[all_points['id'][idx] for idx in matched_idx1],'parent':[parent_pts[parent_id'][idx] for idx in matched_idx2],})` – Helk Jul 19 '17 at 20:07
  • 1
    I hope you do not expect `astropy` to do *everything* that you need. You will have to do some work on your own. If I understood correctly you want something like this: `[(i, matched_idx2[np.where(i == matched_idx1)].tolist()) for i in set(matched_idx1)]`. How to plug these data into a "dataframe" - I do not know. – AGN Gazer Jul 19 '17 at 20:10
  • 1
    I think you actually want something like this: `DataFrame({'points': close_pts_idx1, 'parent': [parent_pts['parent_id'][matched_idx2[np.where(i == matched_idx1)]].tolist() for i in close_pts_idx1]})` – AGN Gazer Jul 19 '17 at 20:27
  • you forgot to define close_pts_idx1 – Helk Jul 19 '17 at 20:46
  • 1
    Oopsy! `close_pts_idx1 = list(set(matched_idx1))` – AGN Gazer Jul 19 '17 at 20:52
1

You could do this instead if you want to use euclidean distance and use the index as your point id instead

def findClose(inX,inY,cIndex,X,Y):
    X,Y = X - inX,Y-inY
    X,Y = X**2,Y**2
    dist = np.sqrt(np.sum([X, Y], axis=0))
    dist[cIndex] = np.max(dist)*100 # ensure not the current index
    return np.argmin(dist)

X,Y = all_points['latitude'].as_matrix(),all_points['longitude'].as_matrix()
all_points['point_id'] = all_points.index
all_points['Parents'] = all_points.apply(lambda row: 
                    findClose(row['latitude'],row['longitude'],
                    row['point_id'],X,Y),axis=1)

which yields

print all_points

   point_id   latitude  longitude  Parents
0         0  41.894577 -87.645307        1
1         1  41.894647 -87.640426        0
2         2  41.894713 -87.635513        3
3         3  41.894768 -87.630629        4
4         4  41.894830 -87.625793        3
DJK
  • 8,924
  • 4
  • 24
  • 40