Filter out the rows which contain the duplicate value only in the previous row

Question

I am using Python with pandas library. I have a dataframe df. I need to write a function to filter out duplicates, that is to say, to remove the rows which contain the same value as a row above

example :

df = pd.DataFrame({'A': {0: 1, 1: 2, 2: 2, 3: 3, 4: 4, 5: 5, 6: 5, 7: 5, 8: 6, 9: 7, 10: 7}, 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k'}})

I wrote the code below.

total_len = len(df.index)
for i in range(total_len):
        if df['A'].loc[i] == df['A'].loc[i+1]: 
            df['A'].drop(df['A'].index[i+1])
        else:
            df['A']

what am I doing wrong?

Does this answer your question? [Drop all duplicate rows across multiple columns in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas) — Trenton McKinney, Aug 30 '20 at 23:03
Does this answer the question? [Pandas: Drop duplicates based on row value](https://stackoverflow.com/questions/58490071/pandas-drop-duplicates-based-on-row-value) — Trenton McKinney, Aug 30 '20 at 23:04
unlikely, he is only looking for comparison between the previous row — Akshay Sehgal, Aug 30 '20 at 23:10
@OP - updated my answer to fix your method for comparing dups in previous rows only. do check and let me know if that works for you. — Akshay Sehgal, Aug 30 '20 at 23:56
@ScottBoston - the question is to compare and remove duplicates only between the previous row, not throughout the dataframe — Akshay Sehgal, Aug 30 '20 at 23:59
Or.. `df[df['A'] != df['A'].shift()]` for just the previous row. — Scott Boston, Aug 31 '20 at 00:02
@TrentonMcKinney , thanks for the suggestion. I wanted to remove a duplicate if in the previous previous row . The drop.duplicate drops all of the duplicate in a column but it is useful to know that you can choose to keep the first or last duplicate or none of them . Thank you! — lomye, Aug 31 '20 at 15:48

score 1 · Answer 1 · answered Aug 31 '20 at 00:05

1

You can do it without the loop

df = df[ # filter df with a boolean array
    df.A.ne(df.A.shift()) # find out if elements are different from the row above
]

answered Aug 31 '20 at 00:05

RichieV

5,103
2
11
24

nice approach. Might i add ```df[df.ne(df.shift())['A']]``` also works. – Akshay Sehgal Aug 31 '20 at 00:08
1

Yes that works too, but you compare all columns needlesly, slower with a wide df – RichieV Aug 31 '20 at 00:09
I also recommending not using the dot notation for column referencing. – Scott Boston Aug 31 '20 at 00:10
What scott is recommending is `df['A']` instead of `df.A` I guess. – Akshay Sehgal Aug 31 '20 at 00:12
1

https://stackoverflow.com/questions/44798031/when-should-i-use-dt-column-vs-dtcolumn-pandas please read. – Akshay Sehgal Aug 31 '20 at 00:13
Using the dot notation has its limits such as with column names that have space. You can't use `df.A ColumnName.mean()`, but you can do `df['A ColumnName'].mean() – Scott Boston Aug 31 '20 at 00:13
1

@ScottBoston Agreed, I thought you were saying something would fail in this code... it is worth knowing that but didn't think about it since OP uses standard indexing pretty well – RichieV Aug 31 '20 at 00:18
@AkshaySehgal that is just a suggestion, but if you keep your names legal this is just faster to type, a matter of style I guess – RichieV Aug 31 '20 at 00:20
While not that big a deal, i dont think its about a matter of style, its about best practices and pandas recommends using [] themselves. but then again, like i said before. its not that big a deal – Akshay Sehgal Aug 31 '20 at 00:24
@AkshaySehgal it is not a big deal in fact, as I said it is just faster sometimes and still a valid method, it is good to know the options, benefits, and limitations... I was expecting something more on the lines of _"even though it is possible, it hinders performance"_ – RichieV Aug 31 '20 at 00:37
sure, but this is not about options, benefits or limitations. its about the best practices recommended by pandas. also, i am not the one who pointed it out in the first place, just merely linked it for your reference. – Akshay Sehgal Aug 31 '20 at 00:45

Akshay Sehgal · Answer 2 · 2020-08-31T00:02:46.293

The issue with your code is that the range of this df is 0-10 (11 rows). But, when you use df['A'].loc[i+1] then when i = 10, it searches for the i+1 row to compare it with, which doesn't exist. Hence the KeyError 11

total_len = len(df.index)
for i in range(total_len):
        if df['A'].loc[i] == df['A'].loc[i+1]: 
            df['A'].drop(df['A'].index[i+1])
        else:
            df['A']

#ERROR            
KeyError: 11

Instead, a better way to solve this would be to simply iterate starting from the second row, comparing the one before, to get a list of flags True and False. Then you can use that to filter the df -

dup = [True]

total_len = len(df.index)
for i in range(1, total_len):
    if df.iloc[i]['A'] == df.iloc[i-1]['A']:
        dup.append(False)
    else:
        dup.append(True)
        
print(df[dup])

thank you for explaining what was not working in my code and providing an alternative. I tested the code and added the reset.index(). thank you — lomye, Aug 31 '20 at 15:44
@lomye - glad to help. if this answer has helped you solve the question, do mark it as the correct one! — Akshay Sehgal, Aug 31 '20 at 16:37

Filter out the rows which contain the duplicate value only in the previous row

2 Answers2