1

Here an example:

df = pd.DataFrame({
    'file': ['file1','file1','file1','file1','file2','file3','file4','file4','file4','file4'],
    'text': ['Text1','Text2','Text3','Text4','Text5','Text6','Text7','Text8','Text9','Text10'],
})

I need to remove rows which 'file' repeat 4 time, so in this example i need to remove rows where file = file1 and file4

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Contra111
  • 325
  • 2
  • 10
  • You are looking for this:https://stackoverflow.com/questions/29836836/how-do-i-filter-a-pandas-dataframe-based-on-value-counts ? – skott Oct 24 '19 at 07:45

2 Answers2

3

Use GroupBy.transform for get count of values per groups, so possible filter by boolean indexing:

df1 = df[df.groupby('file')['file'].transform('size') != 4]

Explanation: For using transform is necessary specify some column after groupby for counts - if use size it working same if use any column of DataFrame and it return new column (Series) with same size like original DataFrame filled by counts:

print (df.groupby('file')['file'].transform('size'))
0    4
1    4
2    4
3    4
4    1
5    1
6    4
7    4
8    4
9    4
Name: file, dtype: int64

Or use DataFrameGroupBy.filter - performance should be slowier if large data:

df1 = df.groupby('file').filter(lambda x: len(x) != 4)

Or Series.map with Series.value_counts:

df1 = df[df['file'].map(df['file'].value_counts()) != 4]

print (df)
    file   text
4  file2  Text5
5  file3  Text6
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Using GroupBy with transform:

df[df.groupby('file').text.transform('size').ne(4)]

   file   text
4  file2  Text5
5  file3  Text6
yatu
  • 86,083
  • 12
  • 84
  • 139