Identifying duplicates, then trying to use a for loop to drop them - ans: instead use dropduplicates()

Question

first I'm a newbie so if there's a simpler way to do this, I'm all ears.

I have some relatively simple code to find duplicates, then remove them. I'm not sure what I'm doing wrong. Basically I create a series from .duplicated. Then I'm running a for loop against the data frame to remove the duplicates. I know I have dups (193 of them), but nothing is getting removed. I start with 1893 rows and still have 1893 at the end. Here's what I have so far.

#drop the rows, starting w creating a boolean of where dups are
ms_clnd_bool = ms_clnd_study.duplicated()
print(ms_clnd_bool)  #look at what I have
x = 0
for row in ms_clnd_bool:   #for loop through the duplicates series
    if ms_clnd_bool[x] == True:
        ms_clnd_study.drop(ms_clnd_study.index[x])
    x += 1

ms_clnd_study

Thanks for the help!

check this (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) — Ajay Verma, Oct 03 '20 at 14:24
Please don't post images of code, data, or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`) ... [Discourage screenshots of code and/or errors](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors) — wwii, Oct 03 '20 at 14:25
IMHO, yes please. Your [mre] should also always include a minimal example of the data you are operating on. [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — wwii, Oct 03 '20 at 14:28
i do have code in the proper format above it . . . i was thinking being able to see the results would be helpful, but i can def see where screenshots of it isn't helpful. ty for the guidance! — crobaseball, Oct 03 '20 at 17:41
so the problem that i have is that i can't use .drop_duplicates because what i'm trying to do is actually drop the dups from a different dataframe. i.e. the duplicate values are in one df but the actual rows to drop are in a different one. does this make sense? — crobaseball, Oct 03 '20 at 17:45

score 1 · Answer 1 · answered Oct 03 '20 at 14:32

Pandas has drop_duplicates method: documentation. It does exactly what you are aiming for. You can decide which row to keep (first, last or non).

As a general tip: it's not common to use loops to scan through your whole dataframe in pandas. for something as common as dropping duplicates, you first better look for existing solutions rather then writing one on your own.

As for your code, you should specify inplace=True. Notice that removing while looping can be dangerous: just think what happens if I have a list [1, 2, 3], removing the 2 and keep looping. I'll get index out of bounds. Maybe it won't happens in pandas, but it's a source for troubles

Identifying duplicates, then trying to use a for loop to drop them - ans: instead use dropduplicates()

1 Answers1