1

I'm trying to bin (downsample) a time series based on its timestamps. For instance:

import numpy as np
import pandas as pd

timestamps = np.linspace(0, 1000, 10000)
values = np.random.random(10000)

I usually convert it to a dataframe, and use cut (or qcut) to create the bins:

timeseries_df = pd.DataFrame({"Timestamps": timestamps, "Values": values})
timeseries_df["Bins"] = pd.cut(timeseries_df["Timestamps"],100) #downsampling by two orders of magnitude
ds_timestamps = timeseries_df.groupby("Bins").max()["Timestamps"]
ds_values = timeseries_df.groupby("Bins").mean()["Values"]

This works, but I'm writing functions that I can reuse and I'd like to avoid using pandas if possible. I've tried implementing a version of what's been suggested here

ds_timestamps = np.linspace(timestamps.min(), timestamps.max(), 100)
digitized_timestamps = np.digitize(timestamps, ds_timestamps)
ds_values = [values[digitized_timestamps == i+1].mean() for i in range(len(ds_timestamps))]

This also works but is extremely slow. Is there another way of doing this?

  • Why not use pandas? It's bound to be really, really fast, because internally it uses ultra-fast C code, so it's almost as fast as you'll ever get. –  Nov 05 '21 at 22:16
  • I assumed that there must be an implementation in numpy that has a better performance than pandas. But if that is not the case, I might go with pandas instead. Thanks! – Alex Legaria Nov 05 '21 at 22:25

1 Answers1

0

As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.