How can I drop duplicate data in a single column, group-wise in pandas?

Question

If the df is grouped by A, B, and C, and looks something like this:

    A    B      C    D
    1    53704  hf   51602
                     51602   
                     53802
                ss   53802
                     53802
    2    12811  hf   54205
                hx   50503

I have tried the following, which is similar to something from another post:

    df.groupby([df['A'], df['B'], df['C']]).drop_duplicates(cols='D')

This obviously incorrect as it produces an empty dataframe. I've also tried another variation with drop_duplicates that simply deletes all duplicates from 'D', no matter what group it's in. The output I'm looking for is:

    A    B      C   D
    1    53704  hf  51602
                    53802
                ss  53802
    2    12811  hf  54205
                hx  50503

So that duplicates are only dropped when they are grouped into the same A/B/C combination.

oops. typo. fixed it and added a second duplicate to make things more obvious. — M.A.Kline, Oct 24 '13 at 05:09
What output do you get? It works for me, either `df.groupby(('A','B','C')).drop_duplicates('D')` or `df.drop_duplicates().groupby(('A','B','C')` — Roman Pekar, Oct 24 '13 at 05:35
This is a bit confusing as you are accessing these as columns (df['A']), but they are displayed like the index (is this just how you configured your repr?)... If there not columns, make them columns first, will be easiest. — Andy Hayden, Oct 24 '13 at 06:53
So yes, they are actually indexes, as a result of a previous groupby manipulation. I'm really showing my lack of skills as yet, here, but how do I convert an index into a column? I've looked at the set_index and reindex documentation but I'm not making sense of it. — M.A.Kline, Oct 24 '13 at 19:02
unstack doesn't work, because there are duplicate entries... — M.A.Kline, Oct 24 '13 at 19:09
To further the comment stream to myself, what I was looking for is index_reset()! — M.A.Kline, Oct 24 '13 at 19:42

score 2 · Accepted Answer · answered Oct 24 '13 at 06:26

Assuming these are just columns, you can use drop_duplicates directly:

In [11]: df.drop_duplicates(cols=list('ABCD'))
Out[11]: 
   A      B   C      D
0  1  53704  hf  51602
2  1  53704  hf  53802
3  1  53704  ss  53802
5  2  12811  hf  54205
6  2  12811  hx  50503

If your interested in duplicates of all columns you don't need to specify:

In [12]: df.drop_duplicates()
Out[12]: 
   A      B   C      D
0  1  53704  hf  51602
2  1  53704  hf  53802
3  1  53704  ss  53802
5  2  12811  hf  54205
6  2  12811  hx  50503

accepting this answer because Andy pointed out I can only do this with columns, not indexes. — M.A.Kline, Oct 24 '13 at 19:43

score 1 · Answer 2 · answered Sep 23 '20 at 17:07

1

Updating the syntax on the accepted answer. In pandas 1.1.1 + the following:

df.drop_duplicates(cols=list('ABCD'))

Should be changed to:

df.drop_duplicates(subset=list('ABCD'))

answered Sep 23 '20 at 17:07

SummerEla

1,902
3
26
43

How can I drop duplicate data in a single column, group-wise in pandas?

2 Answers2

Linked