Check duplication based on time series (pandas)

Question

I am working on a dataset that I can see it has duplication but when I use df.duplicates it returns false because of the time column is unique. How can I get the duplication from A,B, C based on time difference of the duplicates? for example, if the time difference is less then 200 ms delete duplicates

sample of my data

Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — jezrael, Mar 22 '18 at 14:24

score 0 · Answer 1 · answered Mar 22 '18 at 14:43

IIUC, you could do something like this:

np.random.seed(123)

df = pd.DataFrame({'A':np.random.randint(1,3,48),'B':np.random.randint(11,13,48),'C':np.random.randint(101,113,48),'time':pd.date_range('2014-09-10',periods=48,freq='10T')})

df.join(df.groupby(pd.Grouper(key='time', freq='30T'), group_keys=False, as_index=False).apply(lambda x: x.duplicated(['A','B','C'], keep=False)).rename('dups'))

Output:

    A   B    C                time   dups
0   1  11  110 2014-09-10 00:00:00  False
1   2  11  103 2014-09-10 00:10:00  False
2   1  12  105 2014-09-10 00:20:00  False
3   1  12  109 2014-09-10 00:30:00  False
4   1  11  102 2014-09-10 00:40:00  False
5   1  11  103 2014-09-10 00:50:00  False
6   1  12  102 2014-09-10 01:00:00  False
7   2  11  102 2014-09-10 01:10:00  False
8   2  12  104 2014-09-10 01:20:00  False
9   1  11  106 2014-09-10 01:30:00  False
10  2  11  110 2014-09-10 01:40:00  False
11  2  12  101 2014-09-10 01:50:00  False
12  1  11  109 2014-09-10 02:00:00  False
13  2  12  112 2014-09-10 02:10:00  False
14  1  11  102 2014-09-10 02:20:00  False
15  2  12  107 2014-09-10 02:30:00  False
16  1  11  104 2014-09-10 02:40:00  False
17  2  11  104 2014-09-10 02:50:00  False
18  2  11  112 2014-09-10 03:00:00  False
19  1  11  106 2014-09-10 03:10:00  False
20  1  12  110 2014-09-10 03:20:00  False
21  1  11  108 2014-09-10 03:30:00  False
22  2  11  110 2014-09-10 03:40:00  False
23  2  12  103 2014-09-10 03:50:00  False
24  2  12  104 2014-09-10 04:00:00   True
25  1  12  112 2014-09-10 04:10:00  False
26  2  12  104 2014-09-10 04:20:00   True
27  1  11  104 2014-09-10 04:30:00  False
28  1  11  109 2014-09-10 04:40:00  False
29  1  11  107 2014-09-10 04:50:00  False
30  1  11  110 2014-09-10 05:00:00  False
31  2  12  108 2014-09-10 05:10:00  False
32  2  12  107 2014-09-10 05:20:00  False
33  2  11  104 2014-09-10 05:30:00  False
34  1  11  110 2014-09-10 05:40:00  False
35  1  11  107 2014-09-10 05:50:00  False
36  2  11  107 2014-09-10 06:00:00  False
37  1  12  112 2014-09-10 06:10:00  False
38  1  11  107 2014-09-10 06:20:00  False
39  2  12  102 2014-09-10 06:30:00  False
40  1  12  111 2014-09-10 06:40:00  False
41  2  11  104 2014-09-10 06:50:00  False
42  1  12  105 2014-09-10 07:00:00  False
43  2  12  104 2014-09-10 07:10:00  False
44  2  12  102 2014-09-10 07:20:00  False
45  2  11  101 2014-09-10 07:30:00  False
46  1  12  106 2014-09-10 07:40:00  False
47  1  12  109 2014-09-10 07:50:00  False

30 minutes. it is a time offset alias see [pandas docs](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) — Scott Boston, Mar 22 '18 at 14:50

Check duplication based on time series (pandas)

1 Answers1