1

I have a relatively large (~300 MB) set of geolocation data, where the format is

Timestamp, id, type, x, y

With the following data types:

In[7]: df.dtypes
Out[7]: 
Timestamp    datetime64[ns]
id                    int64
type                 object
X                     int64
Y                     int64
dtype: object

Each id corresponds to a particular user, and each person has several hundred points recorded across the day.

I want to create a plot showing where everyone is at a certain second. So I need 1 point for every id. However, the data is somewhat sparse, and it's unlikely there's a data point that correlates precisely with that second. I want to approximate by interpolating between the closest two points.

Between data points, I'm assuming people move linearly, so that if we know the location at 8:31:10 and 8:31:50, then at 8:31:30 they should be exactly halfway between the two locations, and at 8:31:11 they should be 1/40th of the way between the points (so interpolating as described here: Pandas data frame: resample with linear interpolation)

I'm thinking the basic process would be:

  • loop through each id:
    • get filter data for that id
    • get last location before time (e.g. the last recorded location before 8:31:11, or whatever time is used)
    • get first location after time (e.g. the first recorded location after 8:31:11, or whatever time is used)
    • interpolate to figure out where they are at that second
    • add location to list
  • plot list of each id's location

I know I can loop through each id with

for name, group in df.groupby('id'):

and plotting isn't a problem, but I'm not sure about the rest.

After a bit of searching I haven't found any good way to do this for a single value from each group. Other answers suggest using the resample and interpolate functions, but that will take way too long with the size of data I have, and does a lot of unnecessary calculations seeing as I only need one point.

Community
  • 1
  • 1
Jezzamon
  • 1,453
  • 1
  • 15
  • 27

1 Answers1

1

It is not quite clear what you want, but lets start with something

First, you probably need list of unique IDs, right?

import pandas as pd
import numpy as np

df = ...

unids = np.unique(df[['id']])

for id in unids:
    df_id = # subset df by id, filtering out rows by id, and get back dataframe
    # sort new df by Timestamp
    tmin = new_df['Timestamp'][0]
    tmax = new_df['Timestamp'][-1]
    tstep = ... # time step

    position = []
    for t in range(tmin, tmax, tstep):
        # interpolate
        # add to position
    plot(position)

is this looks reasonable?

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • I added more to the question to hopefully explain it a little better. Basically, I want 1 point for each person at a certain time (e.g. 10:30:14 AM), but it's unlikely there's any data that corresponds exactly with that time. So I'm thinking I need to get the data just before 10:30:14 and just after 10:30:14 and interpolate. The data for each person spans for a whole day though. – Jezzamon Dec 01 '15 at 05:41