Compare geohashes between two pandas dataframes

Question

I have 2 dataframes df1 and df2 with different Latitutde and Longitude along with their corresponding geohashes. Now for each geohash in df1, I want to find the closest geohash in dataframe df2. I am not sure whether is there a way to compare geohashes. For example, for the id 121 in df1, the closest geohash in df2 would be 9muc3rr and for id 122 in df2, the closest geohash would be 9wv97m1.

Dataframe df1

Id    Latitude   Longitude  Geohash 
121   32.815130 -117.151695  9mudwju
122   37.920948 -108.005043  9wepwr3

Dataframe df2

Id   Latitude    Longitude  Geohash

124  32.604187  -117.005745  9muc3rr
127  37.920948  -108.005043  9wv97m1
135  39.70122   -104.876976  9xj3v7q
128  38.844032  -104.718307  9wvscp6

To my knowledge, I dont see any method in geopandas for comparing geohashes. — user3447653, Aug 18 '16 at 19:25
Isn't the closest geohash the lat/lon pairs with the shortest distance? — Jon Clements, Aug 18 '16 at 19:27
Yes, the closest geohash comes back to closest lat/long pairs — user3447653, Aug 18 '16 at 19:29
If two geohashes do not share a prefix, is their closeness proportional to the closeness of the numerical value of the first character? — Mad Physicist, Aug 18 '16 at 19:47

score 0 · Answer 1 · edited May 23 '17 at 12:00

If you are OK with reinventing the wheel a little, you can convert the (lat, lon) pairs to cartesian unit vectors, and then just compare those using a dot product. Since dot product is basically a measure of the projection of one vector onto another, the product closest to 1 (the maximum) will be the best match between two vectors.

The sample calculation below is based on this answer. I will make the assumption that you are providing geodetic coordinates on the WGS84 ellipsoid (since that what GPS uses), and that the altitude above the ellipsoid is zero for all points:

from math import radians, sin, cos
import numpy as np

# WGS 84 parameters. Any other ellipsoid can be plugged in by changing
# the following lines. All parameters are taken from Wikipedia at
# https://en.wikipedia.org/wiki/Geodetic_datum#Parameters_for_some_geodetic_systems
invFlat = 298.257222101  # Inverse flattening (1/f), unitless
# Derived parameters
e2 = 6694.37999014  # First eccentricity squared. Unitless. Can be computed from 2*f − f**2

# Note that the radius is irrelevant since we are going to
# normalize the result anyway.

def cartesianUnitVector(lat, lon, isdeg=True):
    if isdeg:
        lat, lon = radians(lat), radians(lon)
    vec = np.array([
        cos(lat) * cos(lon),
        cos(lat) * sin(lon),
        (1 - e2) * sin(lat)
    ])
    norm = np.linalg.norm(vec)
    return vec / norm

target = (32.815130, -117.151695)
candidates = [
    (32.604187,  -117.005745),
    (37.920948,  -108.005043),
    (39.70122,   -104.876976),
    (38.844032,  -104.718307)
]

max(candidates, key=lambda x: np.dot(cartesianUnitVector(*x), cartesianUnitVector(*target)))

The geodetic-to-ECEF formula can be found on Wikipedia. The example shows how to operate on an iterable of lat-lon pairs. I am not exactly sure how to adapt this to pandas, but your question is about how to do the comparison, and I think I have provided an answer for that. I am sure that once you have defined the conversion function and the comparison key that uses it, you will have no trouble applying this to pandas.

Compare geohashes between two pandas dataframes

1 Answers1