How to only keep rows in a Pandas DataFrame based on its count in a given column

Question

I have a Pandas DataFrame with some categorical data in one of the columns. On doing value_counts on that particular column, I get something similar to:

HR                          176
Coding                       81
Reject                       74
Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10
Medical Science               9
Core Mechanical               8
Web Development               4
Puzzles                       3
behavioural                   3
not a question                2
civil engineering             1
Mathematics                   1
Finance, Medical Science      1
Sales, HR                     1

What I'd like to do is to only keep the categories with a count >= some threshold (e.g. 10). All the smaller categories should get clubbed in a separate "Other" category i.e. the result should look like:

HR                          176
Coding                       81
Reject                       74

*Other*                      33

Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10

I've done this in the past by hacking together a defaultdict(int) and only taking the instances where count >= threshold. I want to know if there is a Pandas canonical way of achieving the same.

score 1 · Answer 1 · answered Aug 23 '22 at 09:45

I would use a mask to perform boolean indexing and concat:

m = s>=10
out = (pd.concat([s[m], pd.Series(s[~m].sum(), index=['Others'])])
         .sort_values(ascending=False)
      )

output:

HR                         176
Coding                      81
Reject                      74
Others                      33
Database Administration     21
Finance                     17
Project Management          16
Sales                       15
DevOps                      13
Core Electronics            10
Networking

bvittrant · Accepted Answer · 2022-08-23T09:58:49.883

0

Is this the answer you're looking for :

Pandas: Selecting rows based on value counts of a particular column

Else maybe this is what you want :

data = pd.DataFrame([["researcher",150],["politician",15],["builder",1],["teacher",5],])
data.columns = ["category", "count"]
filter_value = 10
d1 = data[data['count'] >= filter_value]
d2 = data[data['count'] < filter_value]
d1["tag"] = "filter_passed"
d2["tag"] = "Others"
data = pd.concat([d1,d2])
>>> data
     category  count            tag
0  researcher    150  filter_passed
1  politician     15  filter_passed
2     builder      1         Others
3     teacher      5         Others

edited Aug 23 '22 at 09:58

answered Aug 23 '22 at 09:36

bvittrant

79
6

This only answers the selection part, however I also need to edit the tag for the rows where the row < `threshold` into a new category altogether – Abirbhav G. Aug 23 '22 at 09:46
You want 2 separated dataframe in the end ? – bvittrant Aug 23 '22 at 09:48
Upon re-reading my question I realise I've worded it wrong. This is the most helpful and accurate answer for the wording used. I'll be opening a new question re-framing the appropriate query. Thanks – Abirbhav G. Aug 23 '22 at 09:56
I edited the code I think it can be useful for you :) – bvittrant Aug 23 '22 at 09:57

How to only keep rows in a Pandas DataFrame based on its count in a given column

2 Answers2