Searching for Duplicate rows with same date, time(time +/-30 secs) and event name

Question

I have a table with 3 columns: date, time, and event. I want to remove the duplicate rows which have the same date, same time (the time can match upto + or - 30 seconds with the time in the same date) and same event. I have attached a picture of the table and the desired output table. As you can see, the last row is missing in the output table. Row 1 and Row 3 have same date and same time (the time difference is within 30 seconds of each other), but event is different. So row 3 is not deleted. On the other, row 6 and 10 hand have same date, same time (within 30 seconds of each other, we should consider both + and -30 seconds), and same event, and hence the duplicate is deleted and only the first among the duplicate row is kept in the file. Kindly advise. My idea was to group by date and then compare time and events.

I think it makes more sense to use bins for the grouping. E.g. if you have 3 rows, all within the same minute but differs in seconds at 10th, 30th and 50th. Based on your logic, the first row and the second row will be grouped. However, the 2nd row and 3rd row can be grouped too. Maybe consider bins of 1-min frequency? — kerwei, Mar 07 '19 at 01:56
i tried your idea..it is becoming a bit complex and not giving the correct answer. In case you are able to do the same, please let me know — sandy, Mar 07 '19 at 02:51
My advice is to concatenate date and time columns to datetime `dt = datetime.datetime.combine(date, time)` and then get timestamp, which is seconds count (int value) `timesatmp = dt.timestamp()`. After this, you can just simply iterate over rows and compare their timestamps as simple ints. — Platon, Mar 07 '19 at 06:43

kerwei · Answer 1 · 2019-03-07T06:31:05.777

0

Here's one possible solution - though I agree that it gets messy as I was working on it. Depending on the range in your dates, it may take quite a while to run during the resampling into 1-minutely bins

import pandas as pd

df = pd.read_csv('my_sample_data.csv')
>>>df
         Date        Time Event
0  12/15/2018  6:55:41 AM   abc
1  12/15/2018  1:36:39 PM   def
2  12/15/2018  2:21:56 PM   com
3  12/16/2018  6:00:11 PM   pil
4  12/16/2018  8:22:20 PM   ati
5  12/17/2018  2:29:10 AM   iti
6  12/17/2018  2:29:30 AM   esz
7  12/17/2018  2:29:50 AM   iti
8  12/17/2018  9:04:03 AM   ono
9  12/17/2018  9:35:04 AM   fac

# Create DateTime column
df1 = pd.DataFrame([' '.join([x,y]) for x, y in zip(df['Date'],df['Time'])])
df1.columns = ['DateTime']
df1['DateTime'] = pd.to_datetime(df1['DateTime'])

# Join it back to your original dataframe and set it as the index
df = df.join(df1)
df.set_index('DateTime', inplace=True)

# Bin the data into 1-minute intervals. Join all common events into a single list
outdf = df.resample('1Min')['Event'].unique().to_frame()
outdf.columns = ['Events']
outdf = outdf.loc[outdf['Events'].apply(lambda x: len(x) > 0)]

>>>outdf
                         Events
DateTime
2018-12-15 06:55:00       [abc]
2018-12-15 13:36:00       [def]
2018-12-15 14:21:00       [com]
2018-12-16 18:00:00       [pil]
2018-12-16 20:22:00       [ati]
2018-12-17 02:29:00  [iti, esz]
2018-12-17 09:04:00       [ono]
2018-12-17 09:35:00       [fac]

# Explode the list into separate rows. Solution provided by @Zero at https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows
>>>outdf['Events'].apply(pd.Series).stack().reset_index(level=2, drop=True).to_frame('Events')

                    Events
DateTime
2018-12-15 06:55:00    abc
2018-12-15 13:36:00    def
2018-12-15 14:21:00    com
2018-12-16 18:00:00    pil
2018-12-16 20:22:00    ati
2018-12-17 02:29:00    iti
2018-12-17 02:29:00    esz
2018-12-17 09:04:00    ono
2018-12-17 09:35:00    fac

edited Mar 07 '19 at 06:31

answered Mar 07 '19 at 03:36

kerwei

1,822
1
13
22

Thanks ..but it is not the desired output i am looking for ..your in put has no common event names.. i want to remove the duplicate rows which have the same date, same time (the time can match upto + or - 30 seconds with the time in the same day) and same event – sandy Mar 07 '19 at 03:41
@sandy I updated my answer to include duplicate events at 17 Dec 2018 between 2:29:00 AM and 2:30:00 AM – kerwei Mar 07 '19 at 03:59
in your output iti and esz is clubbed in one row. They should be seoerate rows – sandy Mar 07 '19 at 06:11
Also as you are resampling datetime..the timestamp in output is changing which is incorrect – sandy Mar 07 '19 at 06:13
@sandy What would your expected timestamp be? The method presented above is probably the standard way to establish the 1-minute interval bins i.e. chopping up an hour into 60 minutes with each tick denoting the lower boundary of that bin. In this case, the events at 02:29:10, 02:29:30 and 02:29:50 are binned into 02:29:00. Alternatively, you may see this as 02:29:30 +-30s. If you center the intervals on event timestamps, then you run into the issue that I described above. Do you group the 10th second with the 30th or the 50th with the 30th? – kerwei Mar 07 '19 at 06:27
1

I group 10th with the 30th second. I have changed my codenow and is getting the desired result. Thanks a lot!! – sandy Mar 07 '19 at 06:58
what would be the best way to add date and time coulmns to my output – sandy Mar 08 '19 at 14:16
@sandy From the output above, you can derive the date and time columns respectively by using `df['Date'] = df.index.date` and `df['Time'] = df.index.time` – kerwei Mar 11 '19 at 03:30
@kerwei..thanks..what if i have another two columns called email address and organization id in the initial table. I also want to add them back to the final result.While resampling i want a unique combination of event , email address and organization id and want all of them in final result .What would be the best way to do so? – sandy Mar 11 '19 at 03:41
@sandy As much as possible, try to post the entirety of your requirements together with your original question. Often, the ways to achieve the results can be quite different and if you request for answers piece by piece, you would most likely end up with sub-optimal answers. – kerwei Mar 12 '19 at 02:42
For the method proposed in my answer, I suppose you will have to create one series for each of your column and then zip all three together to form a dataframe. Something like `pd.DataFrame([(x,y,z) for x,y,z in zip(ser1,ser2,ser3)])`. Thereafter, the steps remain the same: 1. Clean up the empty rows and add back the datetime index – kerwei Mar 12 '19 at 02:44

Searching for Duplicate rows with same date, time(time +/-30 secs) and event name

1 Answers1