1

I have some time-series data with multiple columns:

         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2021-08-25     desk       IL  opera  type-s    100
5 2022-09-16     desk       IL  opera  type-s     30
6 2022-09-16     desk       NY  opera  type-s      0
7 2022-11-01     desk       NJ  opera  type-s      0
8 2022-11-01     desk       IL  opera  type-s      0

I want to remove consecutive duplicates ignoring the date, e.g.

...
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2021-08-25     desk       IL  opera  type-s    100
...

Rows with index 2, 4 are duplicates, and I want to only keep the first one.

I tried two approaches,

  1. using drop_duplicates which also drops non-consecutive duplicates, and
  2. using shift() != that I found in this answer which also seems to be doing the same thing.

Here is my test code:

import pandas as pd
from io import StringIO

data = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2021-08-25,desk,IL,opera,type-s,100
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""

df = pd.read_csv(StringIO(data), parse_dates=['date'])

print('\nOriginal \n', df.to_string())

columns = ['hardware', 'location', 'group', 'rtype']


dedup_cols = columns + ['value']

df = df.drop_duplicates(
    subset=dedup_cols, keep='first').reset_index(drop=True)

de_dup = df.loc[(df[dedup_cols].shift() != df[dedup_cols]).any(axis=1)]

print('\nDrop_duplicates: \n', df.to_string())
print('\nShift_filter: \n', de_dup.to_string())

expected = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""

expected_df = pd.read_csv(StringIO(expected), parse_dates=['date'])

print('Equal drop_duplicates ? ', expected_df.equals(df))
print('Equal shift_filter ? ', expected_df.equals(de_dup))
print('Same result from the two techniques? ', de_dup.equals(df))
print('\nExpected:\n', expected_df.to_string())

OUTPUT:

Original 
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2021-08-25     desk       IL  opera  type-s    100
5 2022-09-16     desk       IL  opera  type-s     30
6 2022-09-16     desk       NY  opera  type-s      0
7 2022-11-01     desk       NJ  opera  type-s      0
8 2022-11-01     desk       IL  opera  type-s      0

Drop_duplicates: 
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2022-09-16     desk       IL  opera  type-s     30
5 2022-11-01     desk       NJ  opera  type-s      0
6 2022-11-01     desk       IL  opera  type-s      0

Shift_filter: 
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2022-09-16     desk       IL  opera  type-s     30
5 2022-11-01     desk       NJ  opera  type-s      0
6 2022-11-01     desk       IL  opera  type-s      0
Equal drop_duplicates ?  False
Equal shift_filter ?  False
Same result from the two techniques?  True

Expected:
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2022-09-16     desk       IL  opera  type-s     30
5 2022-09-16     desk       NY  opera  type-s      0
6 2022-11-01     desk       NJ  opera  type-s      0
7 2022-11-01     desk       IL  opera  type-s      0

I recon the drop_duplicates method is not suitable.

I was hoping the second method would work but I don't completely understand why it doesn't.

I am thinking some approach with group_by + shift and take a consecutive diff and then drop where diff is 0. Any ideas?

Wajahat
  • 1,593
  • 3
  • 20
  • 47

2 Answers2

1

I imagine you want to consider consecutive values per group.

Then use groupby.shift and boolean indexing:

cols = ['hardware', 'location', 'group', 'rtype']
out = df.loc[df['value'].ne(df.groupby(cols)['value'].shift())]

Output:

        date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
5 2022-09-16     desk       IL  opera  type-s     30
6 2022-09-16     desk       NY  opera  type-s      0
7 2022-11-01     desk       NJ  opera  type-s      0
8 2022-11-01     desk       IL  opera  type-s      0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • When I execute this code, it seems to be producing the same result as the other two. – Wajahat Jun 10 '23 at 07:53
  • You can't post formatted code as comment. But please check carefully, the output I provided is matching your expected output (only row 4 is dropped). – mozway Jun 10 '23 at 07:54
  • Oh yes you are right, I was using the incorrect dataframe ::facepalm:: – Wajahat Jun 10 '23 at 07:57
  • My original code has a bug, which is also why the two approaches give the same output... – Wajahat Jun 10 '23 at 08:02
  • No worries. NB. If the dates must be increasing ensure the dates are sorted prior to running the code ;) – mozway Jun 10 '23 at 08:03
0
import pandas as pd
from io import StringIO

data = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2022-09-16,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2021-08-25,desk,IL,opera,type-s,100
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""

df = pd.read_csv(StringIO(data), parse_dates=['date'])
dup_cols = ['hardware', 'location', 'group', 'rtype', 'value']

mask =(df[dup_cols] == df[dup_cols].shift(1)).all(axis=1).fillna(False)


df = df[~mask].reset_index(drop=True)

print(df)

Try this one. I hope it will work

Venkat
  • 56
  • 1
  • 7