1

I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:

Transaction ID    Time    Grouped ID
    1          08:10:02       1
    2          08:10:03       1
    3          08:10:50
    4          08:10:55
    5          08:11:00       2
    6          08:11:01       2
    7          08:11:02       2
    8          08:11:03       3
    9          08:11:04       3
   10          08:15:00

Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.

Thanks very much for any insight you can offer!

wrcobb
  • 543
  • 3
  • 7
  • 17

1 Answers1

2

The solution i suggest requires you to reindex your data with your Time data. You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).

I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.

Using the following code,

import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np

# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))

# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))

# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])

# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
    dgroups.loc[mini:maxi] = i

# update original dataframe
df['Group id'] = dgroups['Group id']

I was able to obtain this kind of dataframe:

                            value Group id
2014-11-03 00:00:00             0        0
2014-11-03 00:00:00.001000      1        0
2014-11-03 00:00:00.002000      2        0
2014-11-03 00:00:00.003000      3        0
2014-11-03 00:00:00.004000      4        0
2014-11-03 00:00:00.005000      5        1
2014-11-03 00:00:00.006000      6        1
2014-11-03 00:00:00.007000      7        1
2014-11-03 00:00:00.008000      8        1
2014-11-03 00:00:00.009000      9        1
2014-11-03 00:00:00.010000     10        2
2014-11-03 00:00:00.011000     11        2
2014-11-03 00:00:00.012000     12        2
2014-11-03 00:00:00.013000     13        2
2014-11-03 00:00:00.014000     14        2
Community
  • 1
  • 1
amyrit
  • 485
  • 5
  • 7