1

How to calculate the variance of location details

Location has latitude and longitude. I am looking for a single value that will capture the variance of the location details (not separate variance for latitude and longitude). What is the best way to achieve that?

>>> pdf = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
              'longitude': {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0}
             })

>>> pdf

  latitude  longitude

0   47.0    29.0
8   54.0    10.0
14  55.0    36.0
15  39.0    -9.0
2   31.0    121.0

As per numpy documentation, np.var either flattens and then calculates the variance or gives per column wise.

axis None or int or tuple of ints, optional Axis or axes along which the variance is computed. The default is to compute the variance of the flattened array.

Expected (just an example)

>>> variance(pdf)
27.9

I would like to understand if the coordinates are close to each other. What is the best possible approach to get a "combined variance"?

blackraven
  • 5,284
  • 7
  • 19
  • 45
s510
  • 2,271
  • 11
  • 18

2 Answers2

1

If I understood you correctly, you're looking for a score to describe how close the a group of coordinates are. So if this score is higher, the coordinates are spread further apart.

You could create a new feature by multiplying long*lat, then use the variance of this new feature as the score to compare different groups of coordinates. Let me illustrate with an example:

import matplotlib as plt
import pandas as pd

#these points are closer together
df1 = pd.DataFrame({'latitude': {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
                   'longitude': {0: 54.0, 8: 55.0, 14: 39.0, 15: 31.0, 2: 47.0} })
df1['new'] = (df1['latitude']-df1['latitude'].mean()).mul(df1['longitude']-df1['longitude'].mean()).div(100)
score = df1['new'].var()
df1.plot(kind='scatter', x='longitude', y='latitude')

Output score 0.4407372

enter image description here

#these points are having the same spread, but at different location
df2 = pd.DataFrame({'latitude': {0: 147.0, 8: 154.0, 14: 155.0, 15: 139.0, 2: 131.0},
                   'longitude': {0: 154.0, 8: 155.0, 14: 139.0, 15: 131.0, 2: 147.0} })
df2['new'] = (df2['latitude']-df2['latitude'].mean()).mul(df2['longitude']-df2['longitude'].mean()).div(100)
score = df2['new'].var()
df2.plot(kind='scatter', x='longitude', y='latitude')

Output score 0.4407372

enter image description here

#these points are further apart
df3 = pd.DataFrame({'latitude': {0: 14.0, 8: 15.0, 14: 155.0, 15: 13.0, 2: 131.0},
                   'longitude': {0: 15.0, 8: 215.0, 14: 39.0, 15: 131.0, 2: 147.0} })
df3['new'] = (df3['latitude']-df3['latitude'].mean()).mul(df3['longitude']-df3['longitude'].mean()).div(100)
score = df3['new'].var()
df3.plot(kind='scatter', x='longitude', y='latitude')

Output score 2332.5498432

enter image description here

blackraven
  • 5,284
  • 7
  • 19
  • 45
  • I like your solution, Thank you. But what I would ideally wanted is both the cases return same output score, because technically speaking the variance is same in both cases. Maybe subtracting the mean and then computing this would lead to same value... – s510 Aug 30 '22 at 19:26
  • Yes you're right! I've added more info, so this should be correct now, thanks for pointing out my careless mistake! – blackraven Aug 30 '22 at 19:44
1

Single variance measure, converting latlong to cartesian (from recipe).

import pandas as pd
import numpy as np

pdf = pd.DataFrame(
    {
        "latitude": {0: 47.0, 8: 54.0, 14: 55.0, 15: 39.0, 2: 31.0},
        "longitude": {0: 29.0, 8: 10.0, 14: 36.0, 15: -9.0, 2: 121.0},
    }
)

# Lat long is here interpreted as points on a sphere.
# We want to find average distance between all the points and the center of the points.
# To do this we project the spherical coordinates to cartesian coordinates.
def get_cartesian(latlon):
    lat, lon = latlon
    lat, lon = np.deg2rad(lat), np.deg2rad(lon)
    R = 6371  # radius of the earth
    x = R * np.cos(lat) * np.cos(lon)
    y = R * np.cos(lat) * np.sin(lon)
    z = R * np.sin(lat)

    return [x, y, z]


def dist_to_center(coords, center):
    return np.linalg.norm(np.array(coords) - np.array(center))


pdf = pdf.assign(
    latlong=pd.Series([x for x in zip(pdf.latitude.values, pdf.longitude.values)], index=pdf.index),
    cartesian=lambda x: x["latlong"].apply(get_cartesian),
    # split out cartesian coordinates
    x=lambda c: c["cartesian"].apply(lambda x: x[0]),
    y=lambda c: c["cartesian"].apply(lambda x: x[1]),
    z=lambda c: c["cartesian"].apply(
        lambda x: x[2],
    ),
    # calculate center point
    center_x=lambda cn: cn["x"].mean(),
    center_y=lambda cn: cn["y"].mean(),
    center_z=lambda cn: cn["z"].mean(),
    center_coord=lambda x: x[["center_x", "center_y", "center_z"]].apply(lambda x: [x[0], x[1], x[2]], axis=1),
    # calculate the individual points' distance from the center point
    variance_from_center=lambda x: x.apply(lambda x: dist_to_center(x["cartesian"], x["center_coord"]), axis=1),
)

# get single mean for all the points' distance from the center defined by the points' mean position
variance = pdf["variance_from_center"].mean()

result:

2754.22

ivanp
  • 340
  • 1
  • 5
  • Hi @ivanp thank you for this. As far as I understand, you created 3D coordinates from 2D coordinates and then took out the mean of them separately. But I fail to understand how is this measuring the variance? What's the thought around this. – s510 Sep 01 '22 at 08:37
  • My thinking was: lat long could be seen as describing a point on a sphere (or ellipsoid accoding to WGS84, I'm not specialist!); conversion above assumes a sphere. If we plot the lat long points on a cartesian grid and find the center we're hitting a location 'inside' the sphere instead of on the surface of it. To avoid this we're projecting the lat long coordinates into a cartesian space then doing our calculation here. – ivanp Sep 01 '22 at 08:55
  • Yes I get the cartesian thing here. What I find difficult to understand is how sum of means contributing to the variance? – s510 Sep 01 '22 at 09:01
  • I've added to the example above - possibly a bit verbose - to explain a bit more. My interpretation of a single variance measure is: on average, how far away are all points from the middle of the shape described by all the points. – ivanp Sep 01 '22 at 15:13