5

I have a dataframe with lots of rows. Sometimes are values are one ofs and not very useful for my purpose.

How can I remove all the rows from where columns 2 and 3's value doesn't appear more than 5 times?

df input

 Col1     Col2     Col3       Col4
 1        apple    tomato     banana
 1        apple    potato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        grape    tomato     banana
 1        pear     tomato     banana
 1        lemon    tomato     banana

output

 Col1     Col2     Col3       Col4
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
aiden rosenblatt
  • 403
  • 2
  • 5
  • 9
  • Check this very useful answer: https://stackoverflow.com/questions/19384532/how-to-count-number-of-rows-in-a-group-in-pandas-group-by-object – Denis Rasulev Jan 30 '18 at 04:43
  • Are you checking based on global counts, or column wise counts? Does it matter even? Is it possible that `apple` also appears in Col3, besides Col2? – cs95 Jan 30 '18 at 05:35
  • @coldspeed column wise count. counts are restricted to the column so it doesn't if apple appears in a different column, it should be based on col2's count – aiden rosenblatt Jan 30 '18 at 06:05
  • What if it is 'potato` in the first value of `Col3`? What should be the desired output? – Tai Jan 30 '18 at 06:28

6 Answers6

6

Global Counts
Use stack + value_counts + replace -

v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]

   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana

(Update)
Columnwise Counts

Call apply with pd.Series.value_counts on your columns of interest, and filter in the same manner as before -

v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]

   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana

Details
Use value_counts to count values in your dataframe -

c = v.apply(pd.Series.value_counts)
c

        Col2  Col3
apple    6.0   NaN
grape    1.0   NaN
lemon    1.0   NaN
pear     1.0   NaN
potato   NaN   1.0
tomato   NaN   8.0

Call replace, to replace values in the DataFrame with their counts -

i = v.replace(c)
i

   Col2  Col3
0     6     8
1     6     1
2     6     8
3     6     8
4     6     8
5     6     8
6     1     8
7     1     8
8     1     8

From that point,

m = i.gt(5).all(1)

0     True
1    False
2     True
3     True
4     True
5     True
6    False
7    False
8    False
dtype: bool

Use the mask to index df.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • This will also filter out the Col4 which we dont want to. As well as this counts all the values - not in separate columns. So we can have problems with same values in different columns. – Alexey Trofimov Jan 30 '18 at 05:20
  • Just change one "tomato" to "apple" in Col3 and there is a problem. Here we need column-wise counting. – Alexey Trofimov Jan 30 '18 at 05:27
  • 1
    @AlexeyTrofimov That's your interpretation. My interpretation is that the counts are global, based on my understanding of the question. We'll wait for OP to clarify before we discuss any further ;) – cs95 Jan 30 '18 at 05:29
  • counts are based on each row @AlexeyTrofimov I am trying to clean low frequency values in this dataset – aiden rosenblatt Jan 30 '18 at 05:40
  • @aidenrosenblatt Each _row_? Okay, which of these answers have worked for you thus far? There are multiple interpretations to your question, so I'm trying to understand which one it is you need. – cs95 Jan 30 '18 at 05:44
  • I also dont get the _each row_ part :) – Alexey Trofimov Jan 30 '18 at 06:01
  • this solution looks at global counts i need column counts – aiden rosenblatt Jan 30 '18 at 06:08
5

Easy way with transform

counts_col2 = df.groupby("Col2")["Col2"].transform(len)
counts_col3 = df.groupby("Col3")["Col3"].transform(len)

mask = (counts_col2 > 5) & (counts_col3 > 5)

df[mask]

output:

   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana
Alexey Trofimov
  • 4,287
  • 1
  • 18
  • 27
1

To create the example data frame

import pandas as pd

text = '''Col1     Col2     Col3       Col4
 1        apple    tomato     banana
 1        apple    potato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        apple    tomato     banana
 1        grape    tomato     banana
 1        pear     tomato     banana
 1        lemon    tomato     banana'''

count = 1
data = []
for line in text.split('\n'):
    if count == 1:
        headers = line.split()
    else:
        data.append(line.split())
    count += 1

df = pd.DataFrame(data = data,columns=headers)

The value_counts method produces a dict with unique column values as the keys and a count as the value. It is these keys I am assigning to k.

  • value_counts returns a Pandas series object but it is like a dict

This list comprehension has a filtering 'if' statement that ignores keys if the value associated with it isn't > 5

In this example, it returns a list with only one value, but it other cases it could be more.

Col2_more_than_5 = [k for k in df['Col2'].value_counts().keys() 
if df['Col2'].value_counts()[k] > 5]

Col3_more_than_5 = [k for k in df['Col3'].value_counts().keys() 
if df['Col3'].value_counts()[k] > 5]

I now have two lists that contain the string/s that occur > 5 times in each column and now I create a selector that returns rows where both statements are true

df[(df['Col2'].isin(Col2_more_than_5)) & (df['Col3'].isin(Col3_more_than_5))]

The 'isin' method works if there are more than 1 value in the list

enter image description here

memebrain
  • 403
  • 3
  • 9
1
v=df.astype(str).sum(1)
df[v.eq(v.value_counts()[v.value_counts()>=5].index.values[0])]
Out[145]: 
   Col1   Col2    Col3    Col4
0     1  apple  tomato  banana
2     1  apple  tomato  banana
3     1  apple  tomato  banana
4     1  apple  tomato  banana
5     1  apple  tomato  banana
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Hmm... calling `value_counts` twice seems wasteful. Maybe cache the intermediate result?? – cs95 Jan 30 '18 at 05:43
1

Fastest way by @ALollz

def agg_size_nosort(df):
    counts_col2 = df.groupby("Col2", sort=False)["Col2"].transform('size')
    counts_col3 = df.groupby("Col3", sort=False)["Col3"].transform('size')
    mask = (counts_col2 > 5) & (counts_col3 > 5)
    return df[mask]
Zesty Dragon
  • 551
  • 3
  • 18
0

One can also use filter twice.

df.groupby("Col2").filter(lambda x: len(x) >= 5) \
  .groupby("Col3").filter(lambda x: len(x) >= 5)

The documentation of filter says

Return a copy of a DataFrame excluding elements from groups that do not satisfy the boolean criterion specified by func.

Tai
  • 7,684
  • 3
  • 29
  • 49