2

I've a dataset with ~150K entries of GPS coordinates that looks like this:

    log_time    latitude    longitude
0   1.555840e+09    45.429597   11.974981
1   1.555869e+09    45.429597   11.974981
3   1.555869e+09    45.429596   11.974984
4   1.555869e+09    45.429490   11.975089
5   1.555869e+09    45.429092   11.975478
count                     147538
mean      0 days 00:02:27.234798
std       0 days 02:34:54.243149
min              0 days 00:00:00
25%              0 days 00:00:03
50%              0 days 00:00:05
75%              0 days 00:00:08
max      39 days 12:25:39.551000
Name: log_time, dtype: object

I expect such Dataframe to scale to several million records in the near future, so scalability is a priority.

I'd like to interpolate the movements such that for the wider gaps, there's at least a GPS record every 60 seconds.

The standard method would be:

dff = dff.set_index(dff.pop('log_time'))
dff = dff.reindex(np.arange(dff.index.min(), dff.index.max()+1))

Which produces:

latitude    longitude
log_time        
1.555840e+09    45.429597   11.974981
1.555840e+09    NaN NaN
1.555840e+09    NaN NaN
1.555840e+09    NaN NaN
1.555840e+09    NaN NaN

That is to be interpolated with something like dff.interpolate().reset_index().

However I have a huge issue: none of the interpolation functions provided by scipy (and thus by pandas) is suitable for GPS distances—that are arcs, not straight lines. From what I've seen there's no easy way to extend the interpolation function though

I already have the distance function I'd use, but I see no easy way to deploy it without having to resort to nested for-loops.

from geographiclib.geodesic import Geodesic
geod = Geodesic.WGS84

def custom_interpolation(starting_value, ending_value, number_of_missing_values):
    filled_array = [starting_value]

    # 1. create a line between starting_value and ending_value 
    # by solving the inverse geodesic problem
    line = geod.InverseLine(starting_value.lat, starting_value.lon, ending_value.lat, ending_value.long)

    # 2. Determine the length of the steps needed to fill 
    # the missing values between the two extremes; 
    # s13 is the total arc length of the line
    step_lenght = line.s13 / number_of_missing_values

    # 3. Add mid values between the two arrays
    for i in range(1, n + 1):
        distance = min(step_lenght * i, line.s13)
        g = line.Position(distance, Geodesic.STANDARD | Geodesic.LONG_UNROLL)
        filled_array.append(g['lat2'], g['lon2'])

    filled_array.append(ending_value)
    return filled_array

So that something like [(LAT1, LON1), None, None, None, (LAT2, LON2)] can become [(LAT1, LON1), (LAT, LON), (LAT, LON), (LAT2, LON2)].

Gian Segato
  • 2,359
  • 3
  • 29
  • 37
  • 2
    I think your question would be easier "answereable" if you would explain the logic of your interpolation, now people have to go through your code and try to extract the logic themself. – Erfan Dec 28 '19 at 22:30
  • @Erfan good point: done! – Gian Segato Dec 29 '19 at 11:33
  • In what way is the slerp interpolation in the accepted answer to the question you linked not a good solution for you? – Seb Dec 30 '19 at 02:21

1 Answers1

0

You can interpolate using nearest data points; for each point x,y find four other coordinates, and use distances as weight to average the associated data point, a time, an elevation, whatever. The code below uses euclidian which is good enough in short distances, but it can easily be changed to be a more accurate distance measure,

class QuadTreeInterpolator:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
        self.tree = quads.QuadTree((np.mean(x), np.mean(y)), np.std(x)*4, np.std(y)*4)
        for xx,yy,zz in zip(x,y,z):
            self.tree.insert((xx,yy),data=zz)

    def interp_cell(self, x, y, points):
        a = np.array([x,y]).reshape(-1,2)
        b = np.array(points)[:,:2]
        ds = cdist(a,b)
        ds = ds / np.sum(ds)
        ds = 1. - ds
        c = np.array(points)[:,2]
        iz = np.sum(c * ds) / np.sum(ds)
        return iz
            
    def interpolate(self,x,y):
        res = self.tree.nearest_neighbors((x,y), count=4)
        points = np.array([[c.x, c.y, c.data] for c in res])
        return self.interp_cell(x, y, points)
BBSysDyn
  • 4,389
  • 8
  • 48
  • 63