I have some time-series data with multiple columns:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2021-08-25 desk IL opera type-s 100
5 2022-09-16 desk IL opera type-s 30
6 2022-09-16 desk NY opera type-s 0
7 2022-11-01 desk NJ opera type-s 0
8 2022-11-01 desk IL opera type-s 0
I want to remove consecutive duplicates ignoring the date, e.g.
...
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2021-08-25 desk IL opera type-s 100
...
Rows with index 2, 4 are duplicates, and I want to only keep the first one.
I tried two approaches,
- using
drop_duplicates
which also drops non-consecutive duplicates, and - using
shift() !=
that I found in this answer which also seems to be doing the same thing.
Here is my test code:
import pandas as pd
from io import StringIO
data = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2021-08-25,desk,IL,opera,type-s,100
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""
df = pd.read_csv(StringIO(data), parse_dates=['date'])
print('\nOriginal \n', df.to_string())
columns = ['hardware', 'location', 'group', 'rtype']
dedup_cols = columns + ['value']
df = df.drop_duplicates(
subset=dedup_cols, keep='first').reset_index(drop=True)
de_dup = df.loc[(df[dedup_cols].shift() != df[dedup_cols]).any(axis=1)]
print('\nDrop_duplicates: \n', df.to_string())
print('\nShift_filter: \n', de_dup.to_string())
expected = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""
expected_df = pd.read_csv(StringIO(expected), parse_dates=['date'])
print('Equal drop_duplicates ? ', expected_df.equals(df))
print('Equal shift_filter ? ', expected_df.equals(de_dup))
print('Same result from the two techniques? ', de_dup.equals(df))
print('\nExpected:\n', expected_df.to_string())
OUTPUT:
Original
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2021-08-25 desk IL opera type-s 100
5 2022-09-16 desk IL opera type-s 30
6 2022-09-16 desk NY opera type-s 0
7 2022-11-01 desk NJ opera type-s 0
8 2022-11-01 desk IL opera type-s 0
Drop_duplicates:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2022-09-16 desk IL opera type-s 30
5 2022-11-01 desk NJ opera type-s 0
6 2022-11-01 desk IL opera type-s 0
Shift_filter:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2022-09-16 desk IL opera type-s 30
5 2022-11-01 desk NJ opera type-s 0
6 2022-11-01 desk IL opera type-s 0
Equal drop_duplicates ? False
Equal shift_filter ? False
Same result from the two techniques? True
Expected:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2022-09-16 desk IL opera type-s 30
5 2022-09-16 desk NY opera type-s 0
6 2022-11-01 desk NJ opera type-s 0
7 2022-11-01 desk IL opera type-s 0
I recon the drop_duplicates
method is not suitable.
I was hoping the second method would work but I don't completely understand why it doesn't.
I am thinking some approach with group_by
+ shift
and take a consecutive diff and then drop where diff is 0. Any ideas?