2

Building off the question/solution here, I'm trying to set a parameter that will only remove consecutive duplicates if the same value occurs 5 (or more) times consecutively...

I'm able to apply the solution in the linked post which uses .shift() to check if the previous (or a specified value in the past or future by adjusting the shift periods parameter) equals the current value, but how could I adjust this to check several consecutive values simultaneously?

Suppose a dataframe that looks like this:

x    y

1    2
2    2
3    3
4    3
5    3
6    3
7    3
8    4
9    4
10   4
11   4
12   2

I'm trying to achieve this:

x    y

1    2
2    2
3    3
8    4
9    4
10   4
11   4
12   2

Where we lose rows 4,5,6,7 because we found five consecutive 3's in the y column. But keep rows 1,2 because it we only find two consecutive 2's in the y column. Similarly, keep rows 8,9,10,11 because we only find four consecutive 4's in the y column.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
azam
  • 23
  • 3

2 Answers2

2

Let's try cumsum on the differences to find the consecutive blocks. Then groupby().transform('size') to get the size of the blocks:

thresh = 5
s = df['y'].diff().ne(0).cumsum()

small_size = s.groupby(s).transform('size') < thresh
first_rows = ~s.duplicated()

df[small_size | first_rows]

Output:

     x  y
0    1  2
1    2  2
2    3  3
7    8  4
8    9  4
9   10  4
10  11  4
11  12  2
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Very close @Quang Hoang! But I'd like to keep the row at index 2 of the original df - keep the first instance and remove the following consecutive duplicates if we find that five or more consecutive duplicates appear. Thank you! – azam Jul 30 '20 at 22:38
  • @azam I see, updated the answer to keep the first rows. – Quang Hoang Jul 30 '20 at 22:45
  • Worked for test dataset and when applied to full dataset! – azam Jul 31 '20 at 13:27
  • to add to this, wondering your thoughts: instead of dropping the consecutive duplicates, do you have any thoughts on setting them to NaN to keep the original shape of the df? – azam Aug 03 '20 at 12:36
  • `df.loc[~(small_size | first_rows)] = np.nan`. – Quang Hoang Aug 03 '20 at 13:22
0

Not straight forward, I would go with @Quang Hoang

Create a column which gives the number of times a values is duplicated. In this case I used np.where() and df.duplicated() and assigned any count> 4 to be NaN

df['g']=np.where(df.groupby('y').transform(lambda x: x.duplicated(keep='last').count())>4, np.nan,1)

I then create two dataframes. One where I drop all the NaNs and one with only NaNs. In the one with NaNs, I drop all apart from the last index using .last_valid_index(). I then append them and sort by index using .sort_index(). I use iloc[:,:2]) to slice out new column I created above

df.dropna().append(df.loc[df[df.g.isna()].last_valid_index()]).sort_index().iloc[:,:2]

     x    y
0    1.0  2.0
1    2.0  2.0
6    7.0  3.0
7    8.0  4.0
8    9.0  4.0
9   10.0  4.0
10  11.0  4.0
11  12.0  2.0
wwnde
  • 26,119
  • 6
  • 18
  • 32