Remove consecutive duplicates. Leave the middle

Question

This is my DataFrame:

                 dt    value
2000-01-01 00:00:00    296.0
2000-01-01 00:05:00    296.0
2000-01-01 00:10:00    296.0
2000-01-01 00:15:00    296.25
2000-01-01 00:20:00    296.25
2000-01-01 00:25:00    296.25
2000-01-01 00:30:00    296.25
2000-01-01 00:35:00    296.25
2000-01-01 00:40:00    296.25
2000-01-01 00:45:00    296.5
2000-01-01 00:50:00    296.5
2000-01-01 00:55:00    296.5
2000-01-01 01:00:00    296.5
2000-01-01 01:05:00    296.5
2000-01-01 01:10:00    296.5
2000-01-01 01:15:00    296.75
2000-01-01 01:20:00    296.75
2000-01-01 01:50:00    297.0
2000-01-01 01:55:00    297.0
2000-01-01 02:00:00    297.0
2000-01-01 02:05:00    297.0
2000-01-01 02:10:00    297.0
2000-01-01 02:15:00    297.0

I want to remove adjacent duplicates.
The duplicate in the middle should remain. If the number of duplicates is even, take the next one from the middle. If there are 2 duplicates, take second.

Expected output:

                 dt    value
2000-01-01 00:05:00    296.0
2000-01-01 00:30:00    296.25
2000-01-01 01:00:00    296.5
2000-01-01 01:20:00    296.75
2000-01-01 02:05:00    297.0

I read this post about duplicates, but it doesn't satisfy my conditions of choosing the middle element.
Pandas: Drop consecutive duplicates

Please post your expected output. – Mayank Porwal Jan 05 '21 at 17:59 — Mayank Porwal, Jan 05 '21 at 17:59

Vivek Kalyanarangan · Answer 1 · 2021-01-05T18:28:54.187

Use -

df.groupby(['value'])['dt'].apply(lambda x: x.iloc[math.floor(len(x)/2)])

Output

value
296.00    2000-01-01 00:05:00
296.25    2000-01-01 00:30:00
296.50    2000-01-01 01:00:00
296.75    2000-01-01 01:20:00
297.00    2000-01-01 02:05:00
Name: dt, dtype: object

Needless to say, you have to import math

Timings

@ALollz

3.82 ms ± 442 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This one

2.09 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

ALollz · Accepted Answer · 2021-01-05T19:45:02.420

To avoid the slower groupby + apply(lambda ...) we can instead use the built-in methods implemented in cython. Form groups of consecutive values checking against the shifted Series and then use aggregate those groups with size and cumcount to subset the original DataFrame. (Bonus: this preserves the original DataFrame index)

gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)

df[gp.transform('size')//2 == gp.cumcount()]

                     dt   value
1   2000-01-01 00:05:00  296.00
6   2000-01-01 00:30:00  296.25
12  2000-01-01 01:00:00  296.50
16  2000-01-01 01:20:00  296.75
20  2000-01-01 02:05:00  297.00

To illustrate the relative performance here's an example. I create a test DataFrame where each value is repeated twice (just for simplicity) and we can see how these two methods scale with timings as a function of the number of groups (which is the important number for scaling of groupby + apply).

import perfplot
import pandas as pd
import numpy as np

def groupby_apply(df):
    gp = df['value'].ne(df['value'].shift()).cumsum()
    return df.groupby(gp).apply(lambda x: x.iloc[len(x) // 2])
    
def groupby_builtin(df):
    gp = df['value'].ne(df['value'].shift()).cumsum()
    gp = gp.groupby(gp)

    return df[gp.transform('size')//2 == gp.cumcount()]

perfplot.show(
    setup=lambda N: pd.DataFrame({'dt': range(N), 
                                  'value': np.repeat(np.arange(N//2), 2)}),
                                  
    kernels=[
        lambda df: groupby_apply(df),
        lambda df: groupby_builtin(df)],
    
    labels=['apply', 'built-in'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='Proxy for # of Groups'
)

Although I agree your solution looks faster, timing them gives diferent resuts... I am editing my ans to post the timings... can u check from your side as well? — Vivek Kalyanarangan, Jan 05 '21 at 18:27
That was my expectation too. I did try with `df.sample(100000, replace=True)` to scale it up... got the same type of results — Vivek Kalyanarangan, Jan 05 '21 at 18:38
@VivekKalyanarangan I think that's because in your case where you `sample`, you were grouping based on just `value` so when you sample to more rows you never change the number of groups (i.e. you'll still only ever have 5 unique values so 5 groups). The # of groups, not rows, is what matters for the scaling of groupby + apply. The above creates many more groups — ALollz, Jan 05 '21 at 18:48

score 2 · Answer 3 · answered Jan 05 '21 at 18:17

If you have blocks of values that may be repeated, and you want to get the middle of each block:

df['g'] = (df['value'] != df['value'].shift(1)).cumsum()
df.groupby('g').apply(lambda x: x.iloc[len(x) // 2])

Output:

                    dt   value  g
g                                
1  2000-01-01 00:05:00  296.00  1
2  2000-01-01 00:30:00  296.25  2
3  2000-01-01 01:00:00  296.50  3
4  2000-01-01 01:20:00  296.75  4
5  2000-01-01 02:05:00  297.00  5

Remove consecutive duplicates. Leave the middle

3 Answers3