Pandas. Drop duplicate rows to another dataframe

Question

here is data example:

import pandas as pd
df = pd.DataFrame({
    'file': ['file1','file2','file1','file2','file3','file3','file4','file5','file4','file5'],
    'prop1': ['True','False','True','False','False','False','False','True','False','False'],
    'prop2': ['False','False','False','False','True','False','True','False','True','False'],
    'prop3': ['False','True','False','True','False','True','False','False','False','True']
})

file    prop1   prop2   prop3
0   file1   True    False   False
1   file2   False   False   True
2   file1   True    False   False
3   file2   False   False   True
4   file3   False   True    False
5   file3   False   False   True
6   file4   False   True    False
7   file5   True    False   False
8   file4   False   True    False
9   file5   False   False   True

I need to drop duplicated rows with same props values to another dataframe and cut them off original file.
So another dataframe should looks like this (duplicated rows should not repeat):

file    prop1   prop2   prop3
0   file1   True    False   False
3   file2   False   False   True
8   file4   False   True    False

df = df.drop_duplicates() drop onlu 1 duplicated row, but not second like this:

    file    prop1   prop2   prop3
0   file1   True    False   False
1   file2   False   False   True
4   file3   False   True    False
5   file3   False   False   True
6   file4   False   True    False
7   file5   True    False   False
9   file5   False   False   True

try: `new_df = df.loc[df.duplicated()].copy()` to store duplicated values into a new dataframe — help-ukraine-now, Oct 07 '19 at 14:13
Not sure there's a simple way to get the exact indices you show in your expected output. But would suffice to do `df.drop_duplicates(subset=[f'prop{i}' for i in range(1,4)])` — ALollz, Oct 07 '19 at 14:15
yea drop_duplicated works, but i also need to cut duplicated rows off the dataframe — Contra111, Oct 07 '19 at 14:16

score 1 · Answer 1 · answered Oct 07 '19 at 14:15

uniques = df.drop_duplicates()
duplicates = df.iloc[list(set(df.index) - set(uniques.index))]

You can use the pandas method drop_duplicates() first to create a dataframe with only the unique rows. You can then compare the indices of your original dataframe and the indices in the frame with unique rows, the 'dropped' indices are your duplicate rows, which you can copy again from your original dataframe in order to now have your unique rows and duplicated rows seperated.

score 1 · Answer 2 · answered Oct 07 '19 at 14:16

Use DataFrame.drop_duplicates with specify columns names by selecting - all columns without first:

df = df.drop_duplicates(df.columns[1:])

Or seelct columns with prop in columns names:

df = df.drop_duplicates(df.filter(like='prop').columns)

print (df)
    file  prop1  prop2  prop3
0  file1   True  False  False
1  file2  False  False   True
4  file3  False   True  False

Pandas. Drop duplicate rows to another dataframe

2 Answers2