Python: Removing Rows on Count condition

Question

I have a problem filtering a pandas dataframe.

city 
NYC 
NYC 
NYC 
NYC 
SYD 
SYD 
SEL 
SEL
...

df.city.value_counts()

I would like to remove rows of cities that has less than 4 count frequency, which would be SYD and SEL for instance.

What would be the way to do so without manually dropping them city by city?

what is `dropping them city by city`? maybe you should provide your working code. and what you want. — apple apple, Apr 09 '18 at 14:48
I was thinking it might be possible to do so by using df.drop(['SYD']) for example, but not working. — Devin Lee, Apr 09 '18 at 14:53

BENY · Accepted Answer · 2019-12-01T03:09:36.387

50

Here you go with filter

df.groupby('city').filter(lambda x : len(x)>3)
Out[1743]: 
  city
0  NYC
1  NYC
2  NYC
3  NYC

Solution two transform

sub_df = df[df.groupby('city').city.transform('count')>3].copy() 
# add copy for future warning when you need to modify the sub df

edited Dec 01 '19 at 03:09

answered Apr 09 '18 at 14:57

BENY

317,841
20
164
234

2

This is a fantastic one liner! I really should use `groupby` more, at the moment it's still a form of black magic to me. – Aaron N. Brock Apr 09 '18 at 15:02
2

Nice one. Unfortunately, `lambda` tends to make me sick :(. Only good in small doses! – jpp Apr 09 '18 at 17:34
@jpp yep , for the small sample size , I think filter is more clear , but only for small sample size – BENY Apr 09 '18 at 17:44
Tested on a 1 million rows dataframe, jpp’s `value_counts` solution is slightly faster than the `filter` one, but the `transform` solution is much faster than both (the latter took less than 1 s on my dataset when the others took 5.7 s and 8.3 s, respectively). – Skippy le Grand Gourou Jan 15 '21 at 11:04

jpp · Answer 2 · 2020-09-20T00:21:12.413

29

This is one way using pd.Series.value_counts.

counts = df['city'].value_counts()

res = df[~df['city'].isin(counts[counts < 5].index)]

counts is a pd.Series object. counts < 5 returns a Boolean series. We filter the counts series by the Boolean counts < 5 series (that's what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. ~ is the negation operator.

Remember a series is a mapping between index and value. The index of a series does not necessarily contain unique values, but this is guaranteed with the output of value_counts.

edited Sep 20 '20 at 00:21

answered Apr 09 '18 at 14:49

jpp

159,742
34
281
339

Thanks a lot! now I got to look into the meaning of '~' – Devin Lee Apr 09 '18 at 15:02
@DevinLee, that just means elementwise "negate" / "negative" in vectorised functions. – jpp Apr 09 '18 at 15:03
i don't undertand this part `counts[counts < 5].index`. Can you elaborate? – Snow Aug 07 '20 at 13:09
1

@Snow, `counts` is a `pd.Series` object. `counts < 5` returns a Boolean series. We filter the `counts` series by the Boolean `counts < 5` series (that's what the square brackets achieve). We then take the index of the resultant series to find the cities with < 5 counts. Remember a series is a mapping between index and value. The index does not necessarily contain unique values, but with `value_counts` this is guaranteed. – jpp Aug 07 '20 at 13:22

Aaron N. Brock · Answer 3 · 2018-04-09T15:04:41.207

I think you're looking for value_counts()

# Import the great and powerful pandas
import pandas as pd

# Create some example data
df = pd.DataFrame({
    'city': ['NYC', 'NYC', 'SYD', 'NYC', 'SEL', 'NYC', 'NYC']
})

# Get the count of each value
value_counts = df['city'].value_counts()

# Select the values where the count is less than 3 (or 5 if you like)
to_remove = value_counts[value_counts <= 3].index

# Keep rows where the city column is not in to_remove
df = df[~df.city.isin(to_remove)]

score 2 · Answer 4 · answered Apr 09 '18 at 14:59

2

Another solution :

threshold=3
df['Count'] = df.groupby('City')['City'].transform(pd.Series.value_counts)
df=df[df['Count']>=threshold]
df.drop(['Count'], axis = 1, inplace = True)
print(df)

  City
0  NYC
1  NYC
2  NYC
3  NYC

answered Apr 09 '18 at 14:59

Sruthi

2,908
1
11
25

Unfortunately this solution is much slower than the others. – Skippy le Grand Gourou Jan 15 '21 at 11:00

Python: Removing Rows on Count condition

4 Answers4

Linked

Related