1

I need to find all the duplicates in one column of a csv file, and then export these to a different csv file. I've tried answers from this:How do I get a list of all the duplicate items using pandas in python? but am not getting the correct result. Example of my csv file:

    filename,ID,status
    71.wav,107e,accepted
    85.wav,9a99,accepted
    85.wav,d27a,accepted
    86.wav,ea4f,accepted
    86.wav,9f9b,accepted
    75.wav,b734,accepted
    75.wav,3dfb,accepted

I would like an output of:

    85.wav,9a99,accepted
    86.wav,ea4f,accepted
    75.wav,b734,accepted

I tried:

    ids = df["filename"]
    dups = df[ids.isin(ids[ids.duplicated()])].sort_values("filename")
    print dups 

The output of this gave unique values as well as duplicate values.

My expected output would be a csv file with the first duplicate listed as shown above (I edited the question to clarify).

Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33
topplethepat
  • 531
  • 6
  • 23
  • 1
    `df[df.duplicated('filename')]` ? for keeping all dupe values , `df[df.duplicated('filename',keep=False)]` ? – anky May 05 '20 at 16:27
  • But how to extract the duplicates only so I can work with them? – topplethepat May 05 '20 at 16:32
  • thanks i've edited the question now so it's clearer and includes an expected output. – topplethepat May 05 '20 at 16:47
  • 1
    did you try my comments above?addimg `last` to the same resolves it `df[df.duplicated('filename',keep='last')].to_csv(.....)` ? – anky May 05 '20 at 16:50
  • yes i tried the one above and the output csv had the same number of rows as the original. – topplethepat May 05 '20 at 16:53
  • 1
    well `df[df.duplicated('filename',keep='last')]` gives me exactly what you have shown as expected output , if not then probably you have unwanted spaces in your original data and you might want to `strip()` them off – anky May 05 '20 at 16:54
  • 1
    totally right, I did have issues in original data, removed those, and this works great. thanks! – topplethepat May 05 '20 at 17:13

2 Answers2

0

Are you looking for something like this?

df = pd.DataFrame({"id":[1,1,1,1,2,2,3,4,5],
                   "name":["Georgia","Georgia","Georgia","Georgia","Camila","Camila","Diego","Luis","Jose"]})


duplicates = df[df.duplicated(["id"])]

Returns

   id     name
1   1  Georgia
2   1  Georgia
3   1  Georgia
5   2   Camila
blink_182
  • 47
  • 9
0

This method should definitively help.

data = {'Test':[1,2,3,4,5,6,2,4,2,5,6,3,2,7,8,9]}
df = pd.DataFrame(data)

dups = df[df.duplicated()]

returns

Test 6 2 7 4 8 2 9 5 10 6 11 3 12 2

Nekz
  • 41
  • 6