Pandas how update values with counts greater x

Question

I have a pandas column that contains a lot of string that appear less than 5 times, I do not to remove these values however I do want to replace them with a placeholder string called "pruned". What is the best way to do this?

df= pd.DataFrame(['a','a','b','c'],columns=["x"])
# get value counts and set pruned I want something that does as follows
df[df[count<2]] = "pruned"

Possible duplicate of [Pandas DataFrame: replace all values in a column, based on condition](https://stackoverflow.com/questions/31511997/pandas-dataframe-replace-all-values-in-a-column-based-on-condition) — mr.tarsa, Jan 18 '18 at 17:18
Please read [how to make good, reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — cmaher, Jan 18 '18 at 17:19
@tarashypka I don't think this is exactly a dup of https://stackoverflow.com/questions/31511997/pandas-dataframe-replace-all-values-in-a-column-based-on-condition, since it asks for a condition that requires counting over the `df`. Of course this would have been clearer with a minimal reproducible example. — muskrat, Jan 18 '18 at 17:56

muskrat · Accepted Answer · 2018-01-18T22:53:32.223

I suspect there is a more efficient way to do this, but simple way to do it is to build a dict of counts and then prune if those values are below the count threshold. Consider the example df:

df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])

    foo
0   12
1   11
2   4
3   15
4   6
5   12
6   4
7   7

# make a dict with counts
count_dict = {d:(df['foo']==d).sum() for d in df.foo.unique()}
# assign that dict to a column
df['bar'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.bar < 2, 'foo']='pruned'

Returns as desired:

    foo bar
0   12      2
1   pruned  1
2   4       2
3   pruned  1
4   pruned  1
5   12      2
6   4       2
7   pruned  1

(and of course you would change 2 to 5 and dump that bar column if you want).

UPDATE

Per request for an in-place version, here is a one-liner that can do it without assigning another column or creating that dict directly (and thanks @TrigonaMinima for the values_count() tip):

df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
print(df)
df.foo = df.foo.apply(lambda row: 'pruned' if (df.foo.value_counts() < 2)[row] else row)
print(df)

which returns again as desired:

```value_counts``` can give you frequency counts directly. Then you can carry on the filtering. — TrigonaMinima, Jan 18 '18 at 18:06
my only issue with this solution is that it requires that i add an additional column to my original dataframe. Is there anyway to filter in place? — Ari, Jan 18 '18 at 21:05
@Ari see my update above; this updated one-liner achieves what you are looking for without an additional column. Please accept answer if useful. — muskrat, Jan 18 '18 at 22:54

score 0 · Answer 2 · answered Jan 18 '18 at 21:07

This is the solution I ended up using based on the answer above.

import pandas as pd
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
# make a dict with counts
count_dict = dict(df.foo.value_counts())
# assign that dict to a column
df['temp_count'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.temp_count < 2, 'foo']='pruned'
df = df.drop(["temp_count"], axis=1)

Pandas how update values with counts greater x

2 Answers2