How can I remove rows where frequency of the value is less than 5? Python, Pandas

Question

I have a dataframe with lots of rows. Sometimes are values are one ofs and not very useful for my purpose.

How can I remove all the rows from where columns 2 and 3's value doesn't appear more than 5 times?

df input

 Col1     Col2     Col3       Col4
 1        apple    tomato     banana
 1        apple    potato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        grape    tomato     banana
 1        pear     tomato     banana
 1        lemon    tomato     banana

output

 Col1     Col2     Col3       Col4
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana

Check this very useful answer: https://stackoverflow.com/questions/19384532/how-to-count-number-of-rows-in-a-group-in-pandas-group-by-object — Denis Rasulev, Jan 30 '18 at 04:43
Are you checking based on global counts, or column wise counts? Does it matter even? Is it possible that `apple` also appears in Col3, besides Col2? — cs95, Jan 30 '18 at 05:35
@coldspeed column wise count. counts are restricted to the column so it doesn't if apple appears in a different column, it should be based on col2's count — aiden rosenblatt, Jan 30 '18 at 06:05
What if it is 'potato` in the first value of `Col3`? What should be the desired output? — Tai, Jan 30 '18 at 06:28

cs95 · Accepted Answer · 2018-01-30T06:10:50.053

6

Global Counts
Use stack + value_counts + replace -

v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]

   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana

(Update)
Columnwise Counts

Call apply with pd.Series.value_counts on your columns of interest, and filter in the same manner as before -

v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]

   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana

Details
Use value_counts to count values in your dataframe -

c = v.apply(pd.Series.value_counts)
c

        Col2  Col3
apple    6.0   NaN
grape    1.0   NaN
lemon    1.0   NaN
pear     1.0   NaN
potato   NaN   1.0
tomato   NaN   8.0

Call replace, to replace values in the DataFrame with their counts -

i = v.replace(c)
i

   Col2  Col3
0     6     8
1     6     1
2     6     8
3     6     8
4     6     8
5     6     8
6     1     8
7     1     8
8     1     8

From that point,

m = i.gt(5).all(1)

0     True
1    False
2     True
3     True
4     True
5     True
6    False
7    False
8    False
dtype: bool

Use the mask to index df.

edited Jan 30 '18 at 06:10

answered Jan 30 '18 at 05:15

cs95

379,657
97
704
746

This will also filter out the Col4 which we dont want to. As well as this counts all the values - not in separate columns. So we can have problems with same values in different columns. – Alexey Trofimov Jan 30 '18 at 05:20
Just change one "tomato" to "apple" in Col3 and there is a problem. Here we need column-wise counting. – Alexey Trofimov Jan 30 '18 at 05:27
1

@AlexeyTrofimov That's your interpretation. My interpretation is that the counts are global, based on my understanding of the question. We'll wait for OP to clarify before we discuss any further ;) – cs95 Jan 30 '18 at 05:29
counts are based on each row @AlexeyTrofimov I am trying to clean low frequency values in this dataset – aiden rosenblatt Jan 30 '18 at 05:40
@aidenrosenblatt Each _row_? Okay, which of these answers have worked for you thus far? There are multiple interpretations to your question, so I'm trying to understand which one it is you need. – cs95 Jan 30 '18 at 05:44
I also dont get the _each row_ part :) – Alexey Trofimov Jan 30 '18 at 06:01
this solution looks at global counts i need column counts – aiden rosenblatt Jan 30 '18 at 06:08

score 5 · Answer 2 · answered Jan 30 '18 at 05:16

5

Easy way with transform

counts_col2 = df.groupby("Col2")["Col2"].transform(len)
counts_col3 = df.groupby("Col3")["Col3"].transform(len)

mask = (counts_col2 > 5) & (counts_col3 > 5)

df[mask]

output:

   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana

answered Jan 30 '18 at 05:16

Alexey Trofimov

4,287
1
18
27

What about Col4? What if there are 10 columns? Will you call groupby on each one? – cs95 Jan 30 '18 at 05:20
In this paticular Q we focus only on Col2 and Col3 as far as i can see. – Alexey Trofimov Jan 30 '18 at 05:22
The second part of my question still applies. What if there are 10 other columns? – cs95 Jan 30 '18 at 05:23
We can easily loop through them, store counts in list, make a mask. – Alexey Trofimov Jan 30 '18 at 05:25
1

If you're using `groupby`, you can use `.filter` instead of `tranform`: `df.groupby("Col2").filter(lambda grp: len(grp) > 3)` – Ben Nov 25 '19 at 23:35
1

@Alexey Trofimov can you explain df.groupby("Col2")["Col2"].transform(len) what does this mean – Zesty Dragon Jul 24 '20 at 20:40
@ZestyDragon this creates the column with the counts for each value of this column. – Alexey Trofimov Jul 27 '20 at 06:17

memebrain · Answer 3 · 2018-02-01T01:31:41.553

To create the example data frame

import pandas as pd

text = '''Col1     Col2     Col3       Col4
 1        apple    tomato     banana
 1        apple    potato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        grape    tomato     banana
 1        pear     tomato     banana
 1        lemon    tomato     banana'''

count = 1
data = []
for line in text.split('\n'):
    if count == 1:
        headers = line.split()
    else:
        data.append(line.split())
    count += 1

df = pd.DataFrame(data = data,columns=headers)

The value_counts method produces a dict with unique column values as the keys and a count as the value. It is these keys I am assigning to k.

value_counts returns a Pandas series object but it is like a dict

This list comprehension has a filtering 'if' statement that ignores keys if the value associated with it isn't > 5

In this example, it returns a list with only one value, but it other cases it could be more.

Col2_more_than_5 = [k for k in df['Col2'].value_counts().keys() 
if df['Col2'].value_counts()[k] > 5]

Col3_more_than_5 = [k for k in df['Col3'].value_counts().keys() 
if df['Col3'].value_counts()[k] > 5]

I now have two lists that contain the string/s that occur > 5 times in each column and now I create a selector that returns rows where both statements are true

df[(df['Col2'].isin(Col2_more_than_5)) & (df['Col3'].isin(Col3_more_than_5))]

The 'isin' method works if there are more than 1 value in the list

what does this code do? df[(df['Col2'] == first[0]) & (df['Col3'] == second[0]) ] — aiden rosenblatt, Jan 30 '18 at 05:12
it uses the value from each of the list comprehensions to filter the columns — memebrain, Jan 30 '18 at 05:21
I have modified my answer to something that is clearer and a little more robust if there are more than 1 value greater than 5 I also added more explanation. — memebrain, Feb 01 '18 at 01:34

score 1 · Answer 4 · answered Jan 30 '18 at 05:42

1

v=df.astype(str).sum(1)
df[v.eq(v.value_counts()[v.value_counts()>=5].index.values[0])]
Out[145]: 
   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana

answered Jan 30 '18 at 05:42

BENY

317,841
20
164
234

Hmm... calling `value_counts` twice seems wasteful. Maybe cache the intermediate result?? – cs95 Jan 30 '18 at 05:43

score 1 · Answer 5 · answered Jul 28 '20 at 07:13

Fastest way by @ALollz

def agg_size_nosort(df):
    counts_col2 = df.groupby("Col2", sort=False)["Col2"].transform('size')
    counts_col3 = df.groupby("Col3", sort=False)["Col3"].transform('size')
    mask = (counts_col2 > 5) & (counts_col3 > 5)
    return df[mask]

score 0 · Answer 6 · answered Jan 30 '18 at 06:15

One can also use filter twice.

df.groupby("Col2").filter(lambda x: len(x) >= 5) \
  .groupby("Col3").filter(lambda x: len(x) >= 5)

The documentation of filter says

Return a copy of a DataFrame excluding elements from groups that do not satisfy the boolean criterion specified by func.

How can I remove rows where frequency of the value is less than 5? Python, Pandas

6 Answers6

Linked