11

I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. Here is what I am trying to accomplish:

Is it possible to remove duplicates but keep the first 2?

Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea.

Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others.

    Date    Open    High     Low      Close  Roll  Dupes
1  19780106  236.00  237.50  234.50  235.50     0    NaN
2  19780113  235.50  239.00  235.00  238.25     0    NaN
3  19780120  238.00  239.00  234.50  237.00     0    NaN
4  19780127  237.00  238.50  235.50  236.00     1    NaN (KEEP)  
5  19780203  236.00  236.00  232.25  233.50     0    NaN (KEEP)
6  19780127  237.00  238.50  235.50  236.00     0    NaN (KEEP)
7  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)
8  19780127  237.00  238.50  235.50  236.00     0    NaN (DELETE)
9  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)

This is what is currently removing the dupes BUT it's removing all dupes (obviously)

df = df.drop_duplicates('Date')

EDIT: I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date'

antonio_zeus
  • 477
  • 2
  • 11
  • 21

2 Answers2

12

Using head with a groupby keeps the first x entries in each group, which I think accomplishes what you want.

In [52]: df.groupby('Date').head(2)
Out[52]: 
       Date   Open   High     Low   Close  Roll
1  19780106  236.0  237.5  234.50  235.50     0
2  19780113  235.5  239.0  235.00  238.25     0
3  19780120  238.0  239.0  234.50  237.00     0
4  19780127  237.0  238.5  235.50  236.00     0
5  19780203  236.0  236.0  232.25  233.50     0
6  19780127  237.0  238.5  235.50  236.00     0
7  19780203  236.0  236.0  232.25  233.50     0

Edit:

In [16]: df['dupe_count'] = df.groupby('Date')['Roll'].transform('max') + 1

In [17]: df.groupby('Date', as_index=False).apply(lambda x: x.head(x['dupe_count'].iloc[0]))
Out[17]: 
         Date   Open   High     Low   Close  Roll  Dupes  dupe_count
0 1  19780106  236.0  237.5  234.50  235.50     0    NaN           1
1 2  19780113  235.5  239.0  235.00  238.25     0    NaN           1
2 3  19780120  238.0  239.0  234.50  237.00     0    NaN           1
3 4  19780127  237.0  238.5  235.50  236.00     1    NaN           2
  6  19780127  237.0  238.5  235.50  236.00     0    NaN           2
4 5  19780203  236.0  236.0  232.25  233.50     0    NaN           1
chrisb
  • 49,833
  • 8
  • 70
  • 70
3

Assuming Roll can only take the values 0 and 1, if you do

df.groupby(['Date', 'Roll'], as_index=False).first() 

you will get two rows for dates for which one of the rows had Roll = 1 and only one row for dates which have only Roll = 0, which I think is what you want.
If passed as_index=False so that the group keys don't end up in the index as discussed in your comment.

JoeCondron
  • 8,546
  • 3
  • 27
  • 28
  • this is interesting but I am wondering if you can either point me in the right direction or explain how groupby() changes my index which originally was changed to the csv file name. I guess what I didn't explain is that I am pulling in numerous csv files to create this dataFrame, the name of the csv is loaded as the row's index and the reason for duplicate Date is because of the overlapping csv files. – antonio_zeus Sep 14 '15 at 14:56
  • and you are correct, Roll can only take values 0 & 1 – antonio_zeus Sep 14 '15 at 14:57
  • Sure. `groupby` sets the group keys as the index of the result by default. See the edit of my answer for further explanation. – JoeCondron Sep 14 '15 at 17:13
  • hey joe - thank you for this. One more question. Looking at the data in the way it comes out in the dataframe, it appears that the first Date where Roll = 1 is below the second Date where Roll = 0. Is that a sorting feature within groupby? And is there a way to sort it the other way? When I say sort, it's more on how the groupby arranges the two pieces of data as opposed to sorting the entire dataframe. ty – antonio_zeus Sep 14 '15 at 18:10
  • Yes you're right. Again, `groupby` sorts by default but you can disable this by passing `sort=False`. If you want to force the sort with `Roll` descending you can do `result.sort(['Date', 'Roll'], ascending=[True, False])`. This sorts with ascending dates and descending values in `'Roll'` – JoeCondron Sep 14 '15 at 20:02
  • that worked great. I really appreciate the walk-thru on this. very helpful. – antonio_zeus Sep 14 '15 at 21:09