Selecting and eliminating rows of a dataframe that have timestamp too close to other rows

Question

I have a dataframe with data from an equipment. Sometimes the start of the equipment fail and will generate a line of data that the action was done. Then the equipment will try to start again a few seconds later and in most cases within 2 or 3 trials it succeeds.

The problem is that the retries and successes all goes to the same table with no distinction between false starts and real starts. As the equipment starts only once every few hours, all I have to do is find all the rows that have a similar timestamp (inside an interval of 2 minutes for example) and keep only the last one.

The task is to eliminate those "false starts" from the dataframe.

The dataframe is ordered by those timestamps so the index of those "false starts" will be a sequence. It can be done for one equipment by iterating and eliminating if:

df.timestamp_local.iloc[i]-df.timestamp_local.iloc[i-1] =< 'some timedelta'

But is impractical to do when running over thousands of equipments.

Input example of the dataframe where the last 3 ones are one case with only the last row as a "real start":

device_name timestamp_local tk_event_desc
0   A005    2019-08-29 19:14:57 Start
1   A005    2019-09-03 09:11:37 Start
2   A005    2019-09-06 14:06:30 Start
3   A005    2019-09-09 17:39:17 Start
4   A005    2019-09-12 10:43:33 Start
5   A005    2019-09-12 17:07:08 Start
6   A005    2019-09-13 01:18:36 Start
7   A005    2019-09-13 13:20:40 Start
8   A005    2019-09-17 17:54:44 Start
9   A005    2019-09-21 12:29:47 Start
10  A005    2019-09-22 11:58:26 Start
11  A005    2019-09-22 11:58:27 Start
12  A005    2019-09-22 11:58:29 Start

Please include [``mre``](https://stackoverflow.com/q/20109391/4985099) — sushanth, Aug 28 '20 at 08:36
Also please include what you have tried, or at least what you think may work. SO only debugs your attempts or suggest alternate approaches if needed, but will not provide code from scratch. — Jeff, Aug 28 '20 at 09:00
I tried to use iteration looking if the previous one are within the range and deleting it, it solves for one equipment, but it is impractical when I need to apply this for thousands — Pedro Silva, Aug 28 '20 at 09:03

score 2 · Accepted Answer · answered Aug 28 '20 at 09:20

This could be done using groupby on the device column and a wanted time interval. For example, using 2 minutes:

# Make sure the column is datetime type.
df['timestamp_local'] = pd.to_datetime(df['timestamp_local'])

# Copy the time column to  keep the actual timestamp values in the result.
df['time_group'] = df['timestamp_local']

freq = '2Min'
df.groupby(['device_name', pd.Grouper(key='time_group', freq=freq)]) \
  .last() \
  .reset_index() \
  .drop(columns=['time_group'])

Result with the provided data:

   device_name      timestamp_local   tk_event_desc
0         A005  2019-08-29 19:14:57           Start
1         A005  2019-09-03 09:11:37           Start
2         A005  2019-09-06 14:06:30           Start
3         A005  2019-09-09 17:39:17           Start
4         A005  2019-09-12 10:43:33           Start
5         A005  2019-09-12 17:07:08           Start
6         A005  2019-09-13 01:18:36           Start
7         A005  2019-09-13 13:20:40           Start
8         A005  2019-09-17 17:54:44           Start
9         A005  2019-09-21 12:29:47           Start
10        A005  2019-09-22 11:58:29           Start

thanks, never thought about groupby, I have to read more into it and use it more — Pedro Silva, Aug 28 '20 at 09:51

Jeff · Answer 2 · 2020-08-28T09:27:56.223

Iterating a python loop over dataframes are highly discouraged. For why see https://stackoverflow.com/a/55557758/8479618. You should almost always use built in operations as they are optimized under-the-hood.

I'm going to assume your timestamps are already converted to datetimes. If not, use pandas.to_datetime().

Let df be your above original dataframe. We first find the difference between each timestamp and shift it upwards.

df['time_diff'] = tf.timestamp_local.diff().shift(-1)

Then we check if the timedelta is below our threshold, and if it is we don't select it with our boolean mask (Note: I imported timedelta with from datetime import timedelta). In this example I set the threshold at 2 minutes, meaning if a start is within 2 minutes of another start I see it as a false start.

df[
    (df['time_diff']>timedelta(minutes=2)) | #Checks for threshold
    (df['time_diff'].isnull()) #handles the last null value
]

np, it bascially shifts the entire column up or down the number of rows you put in. In this case, `shift(-1)` shifts the column up one row. Note: my answer didn't account for different devices as I didn't know whether your devices might have times close to each other. I would suggest trying to understand Shaido's approach as it takes into account unique devices. — Jeff, Aug 28 '20 at 09:54

Selecting and eliminating rows of a dataframe that have timestamp too close to other rows

2 Answers2