How to find duplicates in pandas dataframe

Question

Editing.

Suppose I have the following series in pandas:

I need to identify each sequence of consecutive duplicates - its first and last index. Using the above example, I need to identify the first sequence of 0.3 (from index 3 to 7) independently from the last sequence of 0.3 (from index 13 to 15).

Using Series.duplicated is insufficient because:

*using keep='first' marks all first instances of duplicates False, but will leave index 13 as True because it is not the first appearance of 0.3.

*Same goes for keep='last'

*keep=False just marks all of the entries as True.

Thank you!

Seems like an easy problem, but hard to visualize without data. Show some sample data — rafaelc, Jun 12 '18 at 20:44
Counting values in a column is already covered in quite a few places on this site and elsewhere. Where are you stuck with those solution? Even without those, where is your basic looping code for recognizing consecutive values (another well-covered application)? — Prune, Jun 12 '18 at 20:48
Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. — MaxU - stand with Ukraine, Jun 12 '18 at 21:33
Thank you and apologies. I tried getting this question quickly and didn't realize it was so unclear. I edited and added a simple example to demonstrate the problem. Thank you in advance! — sa_zy, Jun 13 '18 at 05:32

jezrael · Accepted Answer · 2018-06-13T05:48:54.467

I believe need trick with compare shifted values for not equal by ne with cumsum and last drop_duplicates:

s = df['a'].ne(df['a'].shift()).cumsum()
a = s.drop_duplicates().index
b = s.drop_duplicates(keep='last').index

df = pd.DataFrame({'first':a, 'last':b})
print (df)
   first  last
0      0     2
1      3     7
2      8    10
3     11    12
4     13    15

If want also duplicated value to new column a bit change solution with duplicated:

s = df['a'].ne(df['a'].shift()).cumsum()
a = df.loc[~s.duplicated(), 'a']
b = s.drop_duplicates(keep='last')

df = pd.DataFrame({'first':a.index, 'last':b.index, 'val':a})
print (df)
    first  last  val
0       0     2  0.0
3       3     7  0.3
8       8    10  1.0
11     11    12  0.2
13     13    15  0.3

If need new column:

df['count'] = df['a'].ne(df['a'].shift()).cumsum()
print (df)
      a  count
0   0.0      1
1   0.0      1
2   0.0      1
3   0.3      2
4   0.3      2
5   0.3      2
6   0.3      2
7   0.3      2
8   1.0      3
9   1.0      3
10  1.0      3
11  0.2      4
12  0.2      4
13  0.3      5
14  0.3      5
15  0.3      5

This is perfect. Exactly what I needed. Thank you so much for your help!! — sa_zy, Jun 13 '18 at 05:46

How to find duplicates in pandas dataframe

1 Answers1

Linked