I have a relatively large (~300 MB) set of geolocation data, where the format is
Timestamp, id, type, x, y
With the following data types:
In[7]: df.dtypes
Out[7]:
Timestamp datetime64[ns]
id int64
type object
X int64
Y int64
dtype: object
Each id corresponds to a particular user, and each person has several hundred points recorded across the day.
I want to create a plot showing where everyone is at a certain second. So I need 1 point for every id. However, the data is somewhat sparse, and it's unlikely there's a data point that correlates precisely with that second. I want to approximate by interpolating between the closest two points.
Between data points, I'm assuming people move linearly, so that if we know the location at 8:31:10 and 8:31:50, then at 8:31:30 they should be exactly halfway between the two locations, and at 8:31:11 they should be 1/40th of the way between the points (so interpolating as described here: Pandas data frame: resample with linear interpolation)
I'm thinking the basic process would be:
- loop through each id:
- get filter data for that id
- get last location before time (e.g. the last recorded location before 8:31:11, or whatever time is used)
- get first location after time (e.g. the first recorded location after 8:31:11, or whatever time is used)
- interpolate to figure out where they are at that second
- add location to list
- plot list of each id's location
I know I can loop through each id with
for name, group in df.groupby('id'):
and plotting isn't a problem, but I'm not sure about the rest.
After a bit of searching I haven't found any good way to do this for a single value from each group. Other answers suggest using the resample and interpolate functions, but that will take way too long with the size of data I have, and does a lot of unnecessary calculations seeing as I only need one point.