I am working on a dataset that I can see it has duplication but when I use df.duplicates it returns false because of the time column is unique. How can I get the duplication from A,B, C based on time difference of the duplicates? for example, if the time difference is less then 200 ms delete duplicates
Asked
Active
Viewed 45 times
1
-
3Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Mar 22 '18 at 14:24
1 Answers
0
IIUC, you could do something like this:
np.random.seed(123)
df = pd.DataFrame({'A':np.random.randint(1,3,48),'B':np.random.randint(11,13,48),'C':np.random.randint(101,113,48),'time':pd.date_range('2014-09-10',periods=48,freq='10T')})
df.join(df.groupby(pd.Grouper(key='time', freq='30T'), group_keys=False, as_index=False).apply(lambda x: x.duplicated(['A','B','C'], keep=False)).rename('dups'))
Output:
A B C time dups
0 1 11 110 2014-09-10 00:00:00 False
1 2 11 103 2014-09-10 00:10:00 False
2 1 12 105 2014-09-10 00:20:00 False
3 1 12 109 2014-09-10 00:30:00 False
4 1 11 102 2014-09-10 00:40:00 False
5 1 11 103 2014-09-10 00:50:00 False
6 1 12 102 2014-09-10 01:00:00 False
7 2 11 102 2014-09-10 01:10:00 False
8 2 12 104 2014-09-10 01:20:00 False
9 1 11 106 2014-09-10 01:30:00 False
10 2 11 110 2014-09-10 01:40:00 False
11 2 12 101 2014-09-10 01:50:00 False
12 1 11 109 2014-09-10 02:00:00 False
13 2 12 112 2014-09-10 02:10:00 False
14 1 11 102 2014-09-10 02:20:00 False
15 2 12 107 2014-09-10 02:30:00 False
16 1 11 104 2014-09-10 02:40:00 False
17 2 11 104 2014-09-10 02:50:00 False
18 2 11 112 2014-09-10 03:00:00 False
19 1 11 106 2014-09-10 03:10:00 False
20 1 12 110 2014-09-10 03:20:00 False
21 1 11 108 2014-09-10 03:30:00 False
22 2 11 110 2014-09-10 03:40:00 False
23 2 12 103 2014-09-10 03:50:00 False
24 2 12 104 2014-09-10 04:00:00 True
25 1 12 112 2014-09-10 04:10:00 False
26 2 12 104 2014-09-10 04:20:00 True
27 1 11 104 2014-09-10 04:30:00 False
28 1 11 109 2014-09-10 04:40:00 False
29 1 11 107 2014-09-10 04:50:00 False
30 1 11 110 2014-09-10 05:00:00 False
31 2 12 108 2014-09-10 05:10:00 False
32 2 12 107 2014-09-10 05:20:00 False
33 2 11 104 2014-09-10 05:30:00 False
34 1 11 110 2014-09-10 05:40:00 False
35 1 11 107 2014-09-10 05:50:00 False
36 2 11 107 2014-09-10 06:00:00 False
37 1 12 112 2014-09-10 06:10:00 False
38 1 11 107 2014-09-10 06:20:00 False
39 2 12 102 2014-09-10 06:30:00 False
40 1 12 111 2014-09-10 06:40:00 False
41 2 11 104 2014-09-10 06:50:00 False
42 1 12 105 2014-09-10 07:00:00 False
43 2 12 104 2014-09-10 07:10:00 False
44 2 12 102 2014-09-10 07:20:00 False
45 2 11 101 2014-09-10 07:30:00 False
46 1 12 106 2014-09-10 07:40:00 False
47 1 12 109 2014-09-10 07:50:00 False

Scott Boston
- 147,308
- 15
- 139
- 187
-
-
30 minutes. it is a time offset alias see [pandas docs](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) – Scott Boston Mar 22 '18 at 14:50