How to get a list of first duplicate value in column in csv file with pandas

Question

I need to find all the duplicates in one column of a csv file, and then export these to a different csv file. I've tried answers from this:How do I get a list of all the duplicate items using pandas in python? but am not getting the correct result. Example of my csv file:

    filename,ID,status
    71.wav,107e,accepted
    85.wav,9a99,accepted
    85.wav,d27a,accepted
    86.wav,ea4f,accepted
    86.wav,9f9b,accepted
    75.wav,b734,accepted
    75.wav,3dfb,accepted

I would like an output of:

    85.wav,9a99,accepted
    86.wav,ea4f,accepted
    75.wav,b734,accepted

I tried:

    ids = df["filename"]
    dups = df[ids.isin(ids[ids.duplicated()])].sort_values("filename")
    print dups

The output of this gave unique values as well as duplicate values.

My expected output would be a csv file with the first duplicate listed as shown above (I edited the question to clarify).

`df[df.duplicated('filename')]` ? for keeping all dupe values , `df[df.duplicated('filename',keep=False)]` ? — anky, May 05 '20 at 16:27
But how to extract the duplicates only so I can work with them? — topplethepat, May 05 '20 at 16:32
thanks i've edited the question now so it's clearer and includes an expected output. — topplethepat, May 05 '20 at 16:47
did you try my comments above?addimg `last` to the same resolves it `df[df.duplicated('filename',keep='last')].to_csv(.....)` ? — anky, May 05 '20 at 16:50
yes i tried the one above and the output csv had the same number of rows as the original. — topplethepat, May 05 '20 at 16:53
well `df[df.duplicated('filename',keep='last')]` gives me exactly what you have shown as expected output , if not then probably you have unwanted spaces in your original data and you might want to `strip()` them off — anky, May 05 '20 at 16:54
totally right, I did have issues in original data, removed those, and this works great. thanks! — topplethepat, May 05 '20 at 17:13

blink_182 · Answer 1 · 2020-05-05T16:46:06.503

0

Are you looking for something like this?

df = pd.DataFrame({"id":[1,1,1,1,2,2,3,4,5],
                   "name":["Georgia","Georgia","Georgia","Georgia","Camila","Camila","Diego","Luis","Jose"]})


duplicates = df[df.duplicated(["id"])]

Returns

   id     name
1   1  Georgia
2   1  Georgia
3   1  Georgia
5   2   Camila

edited May 05 '20 at 16:46

answered May 05 '20 at 16:33

blink_182

47
9

i edited my question now so you can see expected result. – topplethepat May 05 '20 at 16:50
So you want to keep last? Try this ```duplicates = df[df.duplicated(["id"],keep = "last")]``` – blink_182 May 05 '20 at 16:55

score 0 · Answer 2 · answered May 05 '20 at 16:35

0

This method should definitively help.

data = {'Test':[1,2,3,4,5,6,2,4,2,5,6,3,2,7,8,9]}
df = pd.DataFrame(data)

dups = df[df.duplicated()]

returns

Test 6 2 7 4 8 2 9 5 10 6 11 3 12 2

answered May 05 '20 at 16:35

Nekz

41
6

How to get a list of first duplicate value in column in csv file with pandas

2 Answers2