1

In this task I have a data frame with timestamp and a service message that is a request of some service, this column has multiple entries of same message in a short period of time because the message keeps on generating until service is accepted or rejected. here is sample dataframe -

timestamp                   service message
2019-02-15 15:56:19         battery check
2019-02-15 15:56:34         battery check
2019-02-15 15:57:02         battery check
2019-02-15 15:57:30         battery check
2019-02-15 15:57:54         battery check
2019-02-15 15:59:10         battery check
2019-02-15 17:05:03         battery check
2019-02-15 17:05:42         battery check
2019-02-15 17:06:40         no network available
2019-02-15 17:06:48         no network available
2019-02-15 17:06:58         no network available
2019-02-15 17:08:31         no network available

I want to remove all duplicates let's say in a 15 min time frame, so that can be counted as 1 service. but simple string check not working since message at 17:05:03 is different from previous since there is more than an hour difference. Is there a way to do this?

Thanks in advance.

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    Does this answer your question? [pandas resample documentation](https://stackoverflow.com/questions/17001389/pandas-resample-documentation) – RichieV Sep 03 '20 at 08:20
  • 1
    Another question https://stackoverflow.com/q/47704702/6692898 – RichieV Sep 03 '20 at 08:22

2 Answers2

0

Use Grouper with column service message for aggregate GroupBy.first:

print (df.groupby([pd.Grouper(freq='15min', key='timestamp'), 'service message']).first())
                                                   timestamp
timestamp           service message                         
2019-02-15 15:45:00 battery check        2019-02-15 15:56:19
2019-02-15 17:00:00 battery check        2019-02-15 17:05:03
                    no network available 2019-02-15 17:06:40

df1 = (df.groupby([pd.Grouper(freq='15min', key='timestamp'), 
                 'service message'])
         .first()
         .reset_index(level=0, drop=True)
         .reset_index())
print (df1)
        service message           timestamp
0         battery check 2019-02-15 15:56:19
1         battery check 2019-02-15 17:05:03
2  no network available 2019-02-15 17:06:40
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

We can simply use boolean masking to remove the duplicates:

m1 = df['timestamp'].diff().le(pd.Timedelta(minutes=15))
m2 = df['service message'].duplicated() & \
     df['service message'].eq(df['service message'].shift())

df = df[~(m1 & m2)]

Details:

print(m1)
0     False
1      True
2      True
3      True
4      True
5      True
6     False
7      True
8      True
9      True
10     True
11     True
Name: timestamp, dtype: bool

print(m2)
0     False
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True
11     True
Name: service message, dtype: bool

Result:

print(df)
            timestamp       service message
0 2019-02-15 15:56:19         battery check
6 2019-02-15 17:05:03         battery check
8 2019-02-15 17:06:40  no network available
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53