I've a dataset with ~150K entries of GPS coordinates that looks like this:
log_time latitude longitude
0 1.555840e+09 45.429597 11.974981
1 1.555869e+09 45.429597 11.974981
3 1.555869e+09 45.429596 11.974984
4 1.555869e+09 45.429490 11.975089
5 1.555869e+09 45.429092 11.975478
count 147538
mean 0 days 00:02:27.234798
std 0 days 02:34:54.243149
min 0 days 00:00:00
25% 0 days 00:00:03
50% 0 days 00:00:05
75% 0 days 00:00:08
max 39 days 12:25:39.551000
Name: log_time, dtype: object
I expect such Dataframe to scale to several million records in the near future, so scalability is a priority.
I'd like to interpolate the movements such that for the wider gaps, there's at least a GPS record every 60 seconds.
The standard method would be:
dff = dff.set_index(dff.pop('log_time'))
dff = dff.reindex(np.arange(dff.index.min(), dff.index.max()+1))
Which produces:
latitude longitude
log_time
1.555840e+09 45.429597 11.974981
1.555840e+09 NaN NaN
1.555840e+09 NaN NaN
1.555840e+09 NaN NaN
1.555840e+09 NaN NaN
That is to be interpolated with something like dff.interpolate().reset_index()
.
However I have a huge issue: none of the interpolation functions provided by scipy (and thus by pandas) is suitable for GPS distances—that are arcs, not straight lines. From what I've seen there's no easy way to extend the interpolation function though
I already have the distance function I'd use, but I see no easy way to deploy it without having to resort to nested for-loops.
from geographiclib.geodesic import Geodesic
geod = Geodesic.WGS84
def custom_interpolation(starting_value, ending_value, number_of_missing_values):
filled_array = [starting_value]
# 1. create a line between starting_value and ending_value
# by solving the inverse geodesic problem
line = geod.InverseLine(starting_value.lat, starting_value.lon, ending_value.lat, ending_value.long)
# 2. Determine the length of the steps needed to fill
# the missing values between the two extremes;
# s13 is the total arc length of the line
step_lenght = line.s13 / number_of_missing_values
# 3. Add mid values between the two arrays
for i in range(1, n + 1):
distance = min(step_lenght * i, line.s13)
g = line.Position(distance, Geodesic.STANDARD | Geodesic.LONG_UNROLL)
filled_array.append(g['lat2'], g['lon2'])
filled_array.append(ending_value)
return filled_array
So that something like [(LAT1, LON1), None, None, None, (LAT2, LON2)]
can become [(LAT1, LON1), (LAT, LON), (LAT, LON), (LAT2, LON2)]
.