2

Problem/Question

I like to filter/reduce trajectory data, only keeping first occurrences per a given square size.

Below I have an example which kind of works but is too slow for the data size I am looking at (~100_000 id's and much more time steps).

So my question is how to do this efficiently and maybe somewhat shorter and cleaner?

Code example

In this example, only one row will be dropped by the filter (see very last step, index "1"). Also the formatting of the final result is... awkward.

Generating dummy data

import numpy as np
import pandas as pd

def time(t):
    return pd.Timestamp("2019-01-01T12") + pd.to_timedelta(t, "d")


arrays = [
    np.array([1, 1, 2, 2, 3, 3]),
    np.array([time(0), time(1), time(396), time(365), time(31), time(365)]),
]
df = pd.DataFrame(np.array([[1.1,0.9,2.1,3.2,5.5,4.5],[1.2,1.0,2.0,3.0,4.5,5.5]]).T, index=arrays, columns=["lon", "lat"])
df.index.names = ["id", "time"]
df

returns

        lon     lat
id  time        
1   2019-01-01 12:00:00     1.1     1.2
    2019-01-02 12:00:00     0.9     1.0
2   2020-02-01 12:00:00     2.1     2.0
    2020-01-01 12:00:00     3.2     3.0
3   2019-02-01 12:00:00     5.5     4.5
    2020-01-01 12:00:00     4.5     5.5

Prepare binning

df = df.assign(lonbin=np.round(df.lon, decimals=0))
df = df.assign(latbin=np.round(df.lat, decimals=0))
df

returns

        lon     lat     lonbin  latbin
id  time                
1   2019-01-01 12:00:00     1.1     1.2     1.0     1.0
    2019-01-02 12:00:00     0.9     1.0     1.0     1.0
2   2020-02-01 12:00:00     2.1     2.0     2.0     2.0
    2020-01-01 12:00:00     3.2     3.0     3.0     3.0
3   2019-02-01 12:00:00     5.5     4.5     6.0     4.0
    2020-01-01 12:00:00     4.5     5.5     4.0     6.0

Filtering for first occurrence of lat-lon combination

x = pd.DataFrame()
for idx in set(df.index.get_level_values("id")):
    x = pd.concat(
        [
            x,
            (
                df.query(f"id == '{idx}'")
                .groupby(["lonbin", "latbin"])
                .idxmin()
                .reset_index()
                .assign(id=idx)
                .set_index("id")
            ),
        ]
    )
x

returns

    lonbin  latbin  lon     lat
id              
1   1.0     1.0     (1, 2019-01-02 12:00:00)    (1, 2019-01-02 12:00:00)
2   2.0     2.0     (2, 2020-02-01 12:00:00)    (2, 2020-02-01 12:00:00)
2   3.0     3.0     (2, 2020-01-01 12:00:00)    (2, 2020-01-01 12:00:00)
3   4.0     6.0     (3, 2020-01-01 12:00:00)    (3, 2020-01-01 12:00:00)
3   6.0     4.0     (3, 2019-02-01 12:00:00)    (3, 2019-02-01 12:00:00)

with (e.g.) sns.jointplot(x=x.lonbin, y=x.latbin, kind="hex") I could then look at a binned plot (or maybe calculating densities with scipy.stats.kde.gaussian_kde).

E.Eisbrenner
  • 143
  • 1
  • 10
  • 3
    you don't need the loop and the concat, your output can be obtained directly with `x = df.groupby(['id', "lonbin", "latbin"]).idxmin().reset_index(level=["lonbin", "latbin"])`, definitely faster than using concat in each loop – Ben.T Jan 15 '21 at 18:12
  • Repeated concat is slow, use concat once at the end of a loop instead: https://stackoverflow.com/a/28670223/530160 – Nick ODell Jan 15 '21 at 22:21
  • if I had another column, with some data like `Age`, how could I calculate for example an average in the constraints of the binning? – E.Eisbrenner Jan 20 '21 at 14:33

1 Answers1

0

As in the comment by @Ben.T using groupby with the other two columns simplifies the code and the speed will just be determined by groupby, i.e.

x = (
    df.groupby(["id", "lonbin", "latbin"])
    .idxmin()
    .reset_index(level=["lonbin", "latbin"])
)
E.Eisbrenner
  • 143
  • 1
  • 10