Calculation of variance of Geo coordinates

Question

How to calculate the `variance` of location details

Location has latitude and longitude. I am looking for a single value that will capture the variance of the location details (not separate variance for latitude and longitude). What is the best way to achieve that?

>>> pdf = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
              'longitude': {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0}
             })

>>> pdf

  latitude  longitude

0   47.0    29.0
8   54.0    10.0
14  55.0    36.0
15  39.0    -9.0
2   31.0    121.0

As per numpy documentation, np.var either flattens and then calculates the variance or gives per column wise.

axis None or int or tuple of ints, optional Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.

Expected (just an example)

>>> variance(pdf)
27.9

I would like to understand if the coordinates are close to each other. What is the best possible approach to get a "combined variance"?

are you looking for the variance of each column? you could do `pdf.describe()` to get the stdev — blackraven, Aug 30 '22 at 18:36
No, not covariance either. Just to understand if the coordinates are close to each other. — s510, Aug 30 '22 at 18:44
I am not sure if that will be the best possible approach because latitudes and longitudes are quite different values. — s510, Aug 30 '22 at 18:47
how abou this? Create a new feature by multiplying lat*long, then find the variance of this feature? — blackraven, Aug 30 '22 at 18:56
You could also combine the variances of each dimension, something along the lines of what's described [here](https://glenbambrick.com/category/spatial-analysis/measuring-geographic-distributions/). In any case, you have to figure out how you will be able to interpret your results. — AlexK, Aug 30 '22 at 19:24

blackraven · Accepted Answer · 2022-08-30T19:42:39.493

If I understood you correctly, you're looking for a score to describe how close the a group of coordinates are. So if this score is higher, the coordinates are spread further apart.

You could create a new feature by multiplying long*lat, then use the variance of this new feature as the score to compare different groups of coordinates. Let me illustrate with an example:

import matplotlib as plt
import pandas as pd

#these points are closer together
df1 = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
                   'longitude': {0: 54.0, 8: 55.0, 14: 39.0, 15: 31.0, 2: 47.0} })
df1['new'] = (df1['latitude']-df1['latitude'].mean()).mul(df1['longitude']-df1['longitude'].mean()).div(100)
score = df1['new'].var()
df1.plot(kind='scatter', x='longitude', y='latitude')

Output score 0.4407372

#these points are having the same spread, but at different location
df2 = pd.DataFrame({'latitude': {0: 147.0, 8: 154.0, 14: 155.0, 15: 139.0, 2: 131.0},
                   'longitude': {0: 154.0, 8: 155.0, 14: 139.0, 15: 131.0, 2: 147.0} })
df2['new'] = (df2['latitude']-df2['latitude'].mean()).mul(df2['longitude']-df2['longitude'].mean()).div(100)
score = df2['new'].var()
df2.plot(kind='scatter', x='longitude', y='latitude')

Output score 0.4407372

#these points are further apart
df3 = pd.DataFrame({'latitude': {0: 14.0, 8: 15.0, 14: 155.0, 15: 13.0, 2: 131.0},
                   'longitude': {0: 15.0, 8: 215.0, 14: 39.0, 15: 131.0, 2: 147.0} })
df3['new'] = (df3['latitude']-df3['latitude'].mean()).mul(df3['longitude']-df3['longitude'].mean()).div(100)
score = df3['new'].var()
df3.plot(kind='scatter', x='longitude', y='latitude')

Output score 2332.5498432

I like your solution, Thank you. But what I would ideally wanted is both the cases return same output score, because technically speaking the variance is same in both cases. Maybe subtracting the mean and then computing this would lead to same value... — s510, Aug 30 '22 at 19:26
Yes you're right! I've added more info, so this should be correct now, thanks for pointing out my careless mistake! — blackraven, Aug 30 '22 at 19:44

ivanp · Answer 2 · 2022-09-01T15:05:14.267

Single variance measure, converting latlong to cartesian (from recipe).

import pandas as pd
import numpy as np

pdf = pd.DataFrame(
    {
        "latitude": {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
        "longitude": {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0},
    }
)

# Lat long is here interpreted as points on a sphere.
# We want to find average distance between all the points and the center of the points.
# To do this we project the spherical coordinates to cartesian coordinates.
def get_cartesian(latlon):
    lat, lon = latlon
    lat, lon = np.deg2rad(lat), np.deg2rad(lon)
    R = 6371  # radius of the earth
    x = R * np.cos(lat) * np.cos(lon)
    y = R * np.cos(lat) * np.sin(lon)
    z = R * np.sin(lat)

    return [x, y, z]


def dist_to_center(coords, center):
    return np.linalg.norm(np.array(coords) - np.array(center))


pdf = pdf.assign(
    latlong=pd.Series([x for x in zip(pdf.latitude.values, pdf.longitude.values)], index=pdf.index),
    cartesian=lambda x: x["latlong"].apply(get_cartesian),
    # split out cartesian coordinates
    x=lambda c: c["cartesian"].apply(lambda x: x[0]),
    y=lambda c: c["cartesian"].apply(lambda x: x[1]),
    z=lambda c: c["cartesian"].apply(
        lambda x: x[2],
    ),
    # calculate center point
    center_x=lambda cn: cn["x"].mean(),
    center_y=lambda cn: cn["y"].mean(),
    center_z=lambda cn: cn["z"].mean(),
    center_coord=lambda x: x[["center_x", "center_y", "center_z"]].apply(lambda x: [x[0], x[1], x[2]], axis=1),
    # calculate the individual points' distance from the center point
    variance_from_center=lambda x: x.apply(lambda x: dist_to_center(x["cartesian"], x["center_coord"]), axis=1),
)

# get single mean for all the points' distance from the center defined by the points' mean position
variance = pdf["variance_from_center"].mean()

result:

2754.22

Hi @ivanp thank you for this. As far as I understand, you created 3D coordinates from 2D coordinates and then took out the mean of them separately. But I fail to understand how is this measuring the variance? What's the thought around this. — s510, Sep 01 '22 at 08:37
My thinking was: lat long could be seen as describing a point on a sphere (or ellipsoid accoding to WGS84, I'm not specialist!); conversion above assumes a sphere. If we plot the lat long points on a cartesian grid and find the center we're hitting a location 'inside' the sphere instead of on the surface of it. To avoid this we're projecting the lat long coordinates into a cartesian space then doing our calculation here. — ivanp, Sep 01 '22 at 08:55
Yes I get the cartesian thing here. What I find difficult to understand is how sum of means contributing to the variance? — s510, Sep 01 '22 at 09:01
I've added to the example above - possibly a bit verbose - to explain a bit more. My interpretation of a single variance measure is: on average, how far away are all points from the middle of the shape described by all the points. — ivanp, Sep 01 '22 at 15:13

Calculation of variance of Geo coordinates

How to calculate the variance of location details

2 Answers2

How to calculate the `variance` of location details