Pandas, remove duplicates by column which duplicate N times

Question

Here an example:

df = pd.DataFrame({
    'file': ['file1','file1','file1','file1','file2','file3','file4','file4','file4','file4'],
    'text': ['Text1','Text2','Text3','Text4','Text5','Text6','Text7','Text8','Text9','Text10'],
})

I need to remove rows which 'file' repeat 4 time, so in this example i need to remove rows where file = file1 and file4

You are looking for this:https://stackoverflow.com/questions/29836836/how-do-i-filter-a-pandas-dataframe-based-on-value-counts ? — skott, Oct 24 '19 at 07:45

jezrael · Accepted Answer · 2019-10-24T07:55:09.547

Use GroupBy.transform for get count of values per groups, so possible filter by boolean indexing:

df1 = df[df.groupby('file')['file'].transform('size') != 4]

Explanation: For using transform is necessary specify some column after groupby for counts - if use size it working same if use any column of DataFrame and it return new column (Series) with same size like original DataFrame filled by counts:

print (df.groupby('file')['file'].transform('size'))
0    4
1    4
2    4
3    4
4    1
5    1
6    4
7    4
8    4
9    4
Name: file, dtype: int64

Or use DataFrameGroupBy.filter - performance should be slowier if large data:

df1 = df.groupby('file').filter(lambda x: len(x) != 4)

Or Series.map with Series.value_counts:

df1 = df[df['file'].map(df['file'].value_counts()) != 4]

print (df)
    file   text
4  file2  Text5
5  file3  Text6

Can you explain for a little first variant, please, how this works, especially ('file)['file']? — Contra111, Oct 24 '19 at 07:51

yatu · Answer 2 · 2019-10-24T07:43:03.743

1

Using GroupBy with transform:

df[df.groupby('file').text.transform('size').ne(4)]

   file   text
4  file2  Text5
5  file3  Text6

edited Oct 24 '19 at 07:43

answered Oct 24 '19 at 07:41

yatu

86,083
12
84
139

Pandas, remove duplicates by column which duplicate N times

2 Answers2