1

I am using Python with pandas library. I have a dataframe df. I need to write a function to filter out duplicates, that is to say, to remove the rows which contain the same value as a row above

example :

df = pd.DataFrame({'A': {0: 1, 1: 2, 2: 2, 3: 3, 4: 4, 5: 5, 6: 5, 7: 5, 8: 6, 9: 7, 10: 7}, 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k'}})

I wrote the code below.

total_len = len(df.index)
for i in range(total_len):
        if df['A'].loc[i] == df['A'].loc[i+1]: 
            df['A'].drop(df['A'].index[i+1])
        else:
            df['A']

what am I doing wrong?

Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
lomye
  • 11
  • 2
  • Does this answer your question? [Drop all duplicate rows across multiple columns in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas) – Trenton McKinney Aug 30 '20 at 23:03
  • Does this answer the question? [Pandas: Drop duplicates based on row value](https://stackoverflow.com/questions/58490071/pandas-drop-duplicates-based-on-row-value) – Trenton McKinney Aug 30 '20 at 23:04
  • 1
    unlikely, he is only looking for comparison between the previous row – Akshay Sehgal Aug 30 '20 at 23:10
  • @OP - updated my answer to fix your method for comparing dups in previous rows only. do check and let me know if that works for you. – Akshay Sehgal Aug 30 '20 at 23:56
  • `df.drop_duplicates('A')` – Scott Boston Aug 30 '20 at 23:58
  • @ScottBoston - the question is to compare and remove duplicates only between the previous row, not throughout the dataframe – Akshay Sehgal Aug 30 '20 at 23:59
  • 3
    Or.. `df[df['A'] != df['A'].shift()]` for just the previous row. – Scott Boston Aug 31 '20 at 00:02
  • 1
    @TrentonMcKinney , thanks for the suggestion. I wanted to remove a duplicate if in the previous previous row . The drop.duplicate drops all of the duplicate in a column but it is useful to know that you can choose to keep the first or last duplicate or none of them . Thank you! – lomye Aug 31 '20 at 15:48

2 Answers2

1

You can do it without the loop

df = df[ # filter df with a boolean array
    df.A.ne(df.A.shift()) # find out if elements are different from the row above
]
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • nice approach. Might i add ```df[df.ne(df.shift())['A']]``` also works. – Akshay Sehgal Aug 31 '20 at 00:08
  • 1
    Yes that works too, but you compare all columns needlesly, slower with a wide df – RichieV Aug 31 '20 at 00:09
  • I also recommending not using the dot notation for column referencing. – Scott Boston Aug 31 '20 at 00:10
  • What scott is recommending is `df['A']` instead of `df.A` I guess. – Akshay Sehgal Aug 31 '20 at 00:12
  • 1
    https://stackoverflow.com/questions/44798031/when-should-i-use-dt-column-vs-dtcolumn-pandas please read. – Akshay Sehgal Aug 31 '20 at 00:13
  • Using the dot notation has its limits such as with column names that have space. You can't use `df.A ColumnName.mean()`, but you can do `df['A ColumnName'].mean() – Scott Boston Aug 31 '20 at 00:13
  • 1
    @ScottBoston Agreed, I thought you were saying something would fail in this code... it is worth knowing that but didn't think about it since OP uses standard indexing pretty well – RichieV Aug 31 '20 at 00:18
  • @AkshaySehgal that is just a suggestion, but if you keep your names legal this is just faster to type, a matter of style I guess – RichieV Aug 31 '20 at 00:20
  • While not that big a deal, i dont think its about a matter of style, its about best practices and pandas recommends using [] themselves. but then again, like i said before. its not that big a deal – Akshay Sehgal Aug 31 '20 at 00:24
  • @AkshaySehgal it is not a big deal in fact, as I said it is just faster sometimes and still a valid method, it is good to know the options, benefits, and limitations... I was expecting something more on the lines of _"even though it is possible, it hinders performance"_ – RichieV Aug 31 '20 at 00:37
  • sure, but this is not about options, benefits or limitations. its about the best practices recommended by pandas. also, i am not the one who pointed it out in the first place, just merely linked it for your reference. – Akshay Sehgal Aug 31 '20 at 00:45
0

The issue with your code is that the range of this df is 0-10 (11 rows). But, when you use df['A'].loc[i+1] then when i = 10, it searches for the i+1 row to compare it with, which doesn't exist. Hence the KeyError 11

total_len = len(df.index)
for i in range(total_len):
        if df['A'].loc[i] == df['A'].loc[i+1]: 
            df['A'].drop(df['A'].index[i+1])
        else:
            df['A']
#ERROR            
KeyError: 11   

Instead, a better way to solve this would be to simply iterate starting from the second row, comparing the one before, to get a list of flags True and False. Then you can use that to filter the df -

dup = [True]

total_len = len(df.index)
for i in range(1, total_len):
    if df.iloc[i]['A'] == df.iloc[i-1]['A']:
        dup.append(False)
    else:
        dup.append(True)
        
print(df[dup])
   A  B
0  1  a
1  2  b
3  3  d
4  4  e
5  5  f
8  6  i
9  7  j
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
  • thank you for explaining what was not working in my code and providing an alternative. I tested the code and added the reset.index(). thank you – lomye Aug 31 '20 at 15:44
  • @lomye - glad to help. if this answer has helped you solve the question, do mark it as the correct one! – Akshay Sehgal Aug 31 '20 at 16:37