0

With dataframe like below,

Time            Lat         Long        Val
19:24:50.925    35.61068333 139.6304283 -54.6   
19:24:51.022    35.61068333 139.6304283 -52.9   
19:24:51.118    35.61068333 139.6304283 -52.6   
19:24:51.215    35.61068394 139.6304283 -52.2
19:24:51.312    35.61068455 139.6304283 -49.3
19:24:51.409    35.61068515 139.6304283 -52.1
19:24:51.506    35.61068576 139.6304283 -52.2
19:24:51.603    35.61068636 139.6304283 -51.3
19:24:51.699    35.61068697 139.6304283 -51.8
19:24:51.796    35.61068758 139.6304283 -52.6
19:24:51.892    35.61068818 139.6304283 -53.5
19:24:51.990    35.61068879 139.6304283 -51.8
19:24:52.087    35.61068939 139.6304283 -54.1
19:24:52.183    35.61069042 139.6304283 -51.8
19:24:52.281    35.61069083 139.6304283 -53.5
19:24:52.378    35.61069125 139.6304283 -55.6
19:24:52.474    35.61069222 139.6304283 -53.2
19:24:52.571    35.61069278 139.6304283 -50.8
19:24:52.668    35.61069333 139.6304283 -54

The column Lat and Long together holds the geographic coordinates for each location and the Val column holds the measurement of some metric at that location. What I need to do is to aggregate the Val column (mean) every 0.005 meters - meaning starting with the first Location (lat/long) as reference check for rows falling within 0.005 meters of it and get the average of Vals and repeat from the next Location ( which is outside the 0.005m limit) - such that the result would look like below. I looked at pandas.Grouper but not sure about how to use it to achieve the results.

Lat Long Val Count_of_records
user3206440
  • 4,749
  • 15
  • 75
  • 132
  • 1
    So to get started you want the pairwise Euclidean distance between all the points? `geopy.distance` should make this easy. But then I'm unclear what you want. The pairwise mean for all points within 0.005 metres of one another? Or the mean for "groups" in which all the members of the group are within 0.005 meters of one another? – jwalton Jan 30 '20 at 10:27
  • If it's the latter case [this previous question](https://stackoverflow.com/q/53075481/11021886) might help. – jwalton Jan 30 '20 at 10:37
  • @Ralph - the idea is starting from the first location as reference check for rows with location within 0.005 meters within that and average it. The result will have `Lat`, `Long` and `mean(Val)` – user3206440 Jan 30 '20 at 11:16
  • To verify I understand: you'd then continue to apply the same procedure to each row, using each row as the reference? – jwalton Jan 30 '20 at 11:30
  • In procedural way - Start with first row as reference, check the next row's distance from ref row, if less than limit distance { store `Val` of current row for aggregation} else { perform aggregation on stored `Val`s and change reference to current row } repeat – user3206440 Jan 30 '20 at 14:26

1 Answers1

1

Apologies, I'm still having a little trouble understanding the question. Hopefully this is it.

Sure this solution is a little verbose, but I think this should make the logic clearer and make it easier to maintain in the future.

import pandas as pd
from io import StringIO
import geopy.distance
import numpy as np

# Setup data as in MWE
df = pd.read_fwf(StringIO("""
    Time            Lat         Long        Val
19:24:50.925    35.61068333 139.6304283 -54.6   
19:24:51.022    35.61068333 139.6304283 -52.9   
19:24:51.118    35.61068333 139.6304283 -52.6   
19:24:51.215    35.61068394 139.6304283 -52.2
19:24:51.312    35.61068455 139.6304283 -49.3
19:24:51.409    35.61068515 139.6304283 -52.1
19:24:51.506    35.61068576 139.6304283 -52.2
19:24:51.603    35.61068636 139.6304283 -51.3
19:24:51.699    35.61068697 139.6304283 -51.8
19:24:51.796    35.61068758 139.6304283 -52.6
19:24:51.892    35.61068818 139.6304283 -53.5
19:24:51.990    35.61068879 139.6304283 -51.8
19:24:52.087    35.61068939 139.6304283 -54.1
19:24:52.183    35.61069042 139.6304283 -51.8
19:24:52.281    35.61069083 139.6304283 -53.5
19:24:52.378    35.61069125 139.6304283 -55.6
19:24:52.474    35.61069222 139.6304283 -53.2
19:24:52.571    35.61069278 139.6304283 -50.8
19:24:52.668    35.61069333 139.6304283 -54"""), header=1)

# Extract longitude and latitude from df
coords = df[['Lat', 'Long']].values
# Compute the distances between consecutive rows of the dataframe
consec_dist = [geopy.distance.geodesic(*i).m for i in zip(coords[:-1], coords[1:])]

# Set up column in which to store our aggregates
df['mean'] = np.zeros(df.shape[0])

# The threshold distance
d = 0.005

# Loop over the rows one at a time
for row in range(df.shape[0] - 1):

    # From comments:
    # if less than limit distance { store Val of current row for aggregation}
    # else { perform aggregation on stored Vals and change reference to current row } repeat

    if consec_dist[row] < d:
        df.loc[row, 'mean'] = df.loc[row, 'Val']
    else:
        df.loc[row, 'mean'] = df.loc[row:row + 1, 'Val'].mean()

This gave me the following:

In [2]: df
Out[2]:
            Time        Lat        Long   Val   mean
0   19:24:50.925  35.610683  139.630428 -54.6 -54.60
1   19:24:51.022  35.610683  139.630428 -52.9 -52.90
2   19:24:51.118  35.610683  139.630428 -52.6 -52.40
3   19:24:51.215  35.610684  139.630428 -52.2 -50.75
4   19:24:51.312  35.610685  139.630428 -49.3 -50.70
5   19:24:51.409  35.610685  139.630428 -52.1 -52.15
6   19:24:51.506  35.610686  139.630428 -52.2 -51.75
7   19:24:51.603  35.610686  139.630428 -51.3 -51.55
8   19:24:51.699  35.610687  139.630428 -51.8 -52.20
9   19:24:51.796  35.610688  139.630428 -52.6 -53.05
10  19:24:51.892  35.610688  139.630428 -53.5 -52.65
11  19:24:51.990  35.610689  139.630428 -51.8 -52.95
12  19:24:52.087  35.610689  139.630428 -54.1 -52.95
13  19:24:52.183  35.610690  139.630428 -51.8 -52.65
14  19:24:52.281  35.610691  139.630428 -53.5 -54.55
15  19:24:52.378  35.610691  139.630428 -55.6 -54.40
16  19:24:52.474  35.610692  139.630428 -53.2 -52.00
17  19:24:52.571  35.610693  139.630428 -50.8 -52.40
18  19:24:52.668  35.610693  139.630428 -54.0   0.00
jwalton
  • 5,286
  • 1
  • 18
  • 36