Problem/Question
I like to filter/reduce trajectory data, only keeping first occurrences per a given square size.
Below I have an example which kind of works but is too slow for the data size I am looking at (~100_000 id's and much more time steps).
So my question is how to do this efficiently and maybe somewhat shorter and cleaner?
Code example
In this example, only one row will be dropped by the filter (see very last step, index "1"). Also the formatting of the final result is... awkward.
Generating dummy data
import numpy as np
import pandas as pd
def time(t):
return pd.Timestamp("2019-01-01T12") + pd.to_timedelta(t, "d")
arrays = [
np.array([1, 1, 2, 2, 3, 3]),
np.array([time(0), time(1), time(396), time(365), time(31), time(365)]),
]
df = pd.DataFrame(np.array([[1.1,0.9,2.1,3.2,5.5,4.5],[1.2,1.0,2.0,3.0,4.5,5.5]]).T, index=arrays, columns=["lon", "lat"])
df.index.names = ["id", "time"]
df
returns
lon lat
id time
1 2019-01-01 12:00:00 1.1 1.2
2019-01-02 12:00:00 0.9 1.0
2 2020-02-01 12:00:00 2.1 2.0
2020-01-01 12:00:00 3.2 3.0
3 2019-02-01 12:00:00 5.5 4.5
2020-01-01 12:00:00 4.5 5.5
Prepare binning
df = df.assign(lonbin=np.round(df.lon, decimals=0))
df = df.assign(latbin=np.round(df.lat, decimals=0))
df
returns
lon lat lonbin latbin
id time
1 2019-01-01 12:00:00 1.1 1.2 1.0 1.0
2019-01-02 12:00:00 0.9 1.0 1.0 1.0
2 2020-02-01 12:00:00 2.1 2.0 2.0 2.0
2020-01-01 12:00:00 3.2 3.0 3.0 3.0
3 2019-02-01 12:00:00 5.5 4.5 6.0 4.0
2020-01-01 12:00:00 4.5 5.5 4.0 6.0
Filtering for first occurrence of lat-lon combination
x = pd.DataFrame()
for idx in set(df.index.get_level_values("id")):
x = pd.concat(
[
x,
(
df.query(f"id == '{idx}'")
.groupby(["lonbin", "latbin"])
.idxmin()
.reset_index()
.assign(id=idx)
.set_index("id")
),
]
)
x
returns
lonbin latbin lon lat
id
1 1.0 1.0 (1, 2019-01-02 12:00:00) (1, 2019-01-02 12:00:00)
2 2.0 2.0 (2, 2020-02-01 12:00:00) (2, 2020-02-01 12:00:00)
2 3.0 3.0 (2, 2020-01-01 12:00:00) (2, 2020-01-01 12:00:00)
3 4.0 6.0 (3, 2020-01-01 12:00:00) (3, 2020-01-01 12:00:00)
3 6.0 4.0 (3, 2019-02-01 12:00:00) (3, 2019-02-01 12:00:00)
with (e.g.) sns.jointplot(x=x.lonbin, y=x.latbin, kind="hex")
I could then look at a binned plot (or maybe calculating densities with scipy.stats.kde.gaussian_kde
).