2

I have the data that look in the following way:

Datetime column     Binary column
2020-01-02 08:30:00 True
2020-01-02 08:31:00 False
2020-01-02 08:32:00 False
2020-01-02 08:33:00 False
2020-01-02 08:34:00 True
.
.
.
2020-01-02 08:58:00 True

As you can see, the data always comes in the intervals of 1 minute. In addition, there is a binary true/false column.

I have a variable gap that specifies the maximum number of consecutive falses that can occur in between the two trues. If gap is bigger, I do nothing; if gap is smaller, I want to drop all affected rows. In our example (for the first 5 rows), if gap=3 or more, I wouldn't want to drop any rows. If gap was smaller (1, 2), I would like to drop row 2, 3, 4.

My current solution solves this problem by using between_dates() method. I iterate through the zipped list of all the dates with True and check if the length of the series of dates in between is smaller or equal to the gap.

Are you aware of any other approach (preferably vectorized) that could solve this problem without using the for loop?

FactorizeD
  • 21
  • 2

1 Answers1

0

After several try and errors, I could figure out a way. I am not sure it is the most optimal, but it is vectorized. Here is the code:

import pandas as pd
import numpy as np

gap = 3  # You can modify this value
# Create dataframe with True/False sequences
tmp = pd.DataFrame([True, False, True, False, False, True, False, False, False, True, False, False,
                    False, False, False, True], columns=['Binary'])
# Convert to zeros and ones to make computations and filtering
tmp['col_0'] = (tmp==False).astype(int)
# Count consecutive False in a vectorized way. Check Note 1 for next line
tmp['col_1'] = ((tmp['col_0'] * (tmp['col_0'].groupby((tmp['col_0'] != tmp['col_0'].shift()).cumsum()).cumcount() + 1)) > gap).astype(int)
# Create NaN in lines we are interested to remove
tmp['col_2'] = tmp['col_1'].replace(1, np.nan)
# Finish creating NaN in lines before we reached the 'gap' value. Check Note 2 for next segment
for counter in range(1, gap + 1):
    tmp['col_2'] = tmp['col_2'] + tmp['col_1'].shift(-counter)
    tmp['col_2'] = tmp['col_2'].replace(1, np.nan)
# The shift() function creates NaN at the end of the Dataframe. I need to verify the last lines (length of dataframe - gap) are ok. Check Note 3
tmp.iloc[np.where(tmp[len(tmp) - gap:]['col_1'] == 0)[0] + len(tmp) - gap, 2] = 0
# Drop the NaN lines
tmp.dropna(inplace=True)

Note 1: Check python pandas - creating a column which keeps a running count of consecutive values

Note 2: I asked a vectorized question here: How to vectorize a function that uses both row and column elements of a dataframe, and @andrej-kesely was very kind to solve it. From here I got the idea of using pd.shift(). Maybe this can be vectorized in a better way, but this is how I could figure it out so far

Note 3: Check pandas dataframe fails to assign value to slice subset

As you can see, there are several steps, but all of them are vectorized.

If this was useful, I would appreciate your upvote and marking it as the solution

xiaxio
  • 631
  • 1
  • 10
  • 15