3

I have a pandas dataframe with multiple columns and rows. I wish to find the consecutive duplicate values in a particular column and delete the entire row of the first occurrence of that duplicate value.

I found a possible solution but it works only with pandas series. a.loc[a.shift() != a] This is the link to the mentioned solution

To visualize my dataframe would be something like this:

Index column0 column1 column2 column3
row0 0.5 25 26 27
row1 0.5 30 31 32
row2 1.0 35 36 37
row3 1.5 40 41 42
Index column0 column1 column2 column3
row1 0.5 30 31 32
row2 1.0 35 36 37
row3 1.5 40 41 42

This would be expected result with the row0 deleted.

P.S This duplicate occurrence does not happen at the beginning in my data, it occurs in random in the column0.

2 Answers2

2
df.loc[df.iloc[:, 0].shift(-1) != df.iloc[:, 0]]

This is the answer! Thank you Quang Hoang!

0

A step by step solution is here.

import pandas as pd
import numpy as np    

df = pd.DataFrame(np.random.randint(0,7,size=(10, 4)), columns=list('ABCD'))    

number_of_occurrence_on_first_column = df.groupby('A')['A'].count()    

has_duplicates_items = number_of_occurrence_on_first_column[number_of_occurrence_on_first_column >1].index    

all_duplicate_items = df[df.A.isin(has_duplicates_items)]    

need_to_delete = pd.DataFrame(all_duplicate_items['A']).drop_duplicates().index
df = df.drop(need_to_delete)
Ario
  • 549
  • 1
  • 8
  • 18