-1

I have a pandas column that contains a lot of string that appear less than 5 times, I do not to remove these values however I do want to replace them with a placeholder string called "pruned". What is the best way to do this?

df= pd.DataFrame(['a','a','b','c'],columns=["x"])
# get value counts and set pruned I want something that does as follows
df[df[count<2]] = "pruned"
Ari
  • 563
  • 2
  • 17
  • Possible duplicate of [Pandas DataFrame: replace all values in a column, based on condition](https://stackoverflow.com/questions/31511997/pandas-dataframe-replace-all-values-in-a-column-based-on-condition) – mr.tarsa Jan 18 '18 at 17:18
  • Please read [how to make good, reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – cmaher Jan 18 '18 at 17:19
  • 1
    @tarashypka I don't think this is exactly a dup of https://stackoverflow.com/questions/31511997/pandas-dataframe-replace-all-values-in-a-column-based-on-condition, since it asks for a condition that requires counting over the `df`. Of course this would have been clearer with a minimal reproducible example. – muskrat Jan 18 '18 at 17:56

2 Answers2

1

I suspect there is a more efficient way to do this, but simple way to do it is to build a dict of counts and then prune if those values are below the count threshold. Consider the example df:

df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])

    foo
0   12
1   11
2   4
3   15
4   6
5   12
6   4
7   7

# make a dict with counts
count_dict = {d:(df['foo']==d).sum() for d in df.foo.unique()}
# assign that dict to a column
df['bar'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.bar < 2, 'foo']='pruned'

Returns as desired:

    foo bar
0   12      2
1   pruned  1
2   4       2
3   pruned  1
4   pruned  1
5   12      2
6   4       2
7   pruned  1

(and of course you would change 2 to 5 and dump that bar column if you want).

UPDATE

Per request for an in-place version, here is a one-liner that can do it without assigning another column or creating that dict directly (and thanks @TrigonaMinima for the values_count() tip):

df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
print(df)
df.foo = df.foo.apply(lambda row: 'pruned' if (df.foo.value_counts() < 2)[row] else row)
print(df)

which returns again as desired:

   foo
0   12
1   11
2    4
3   15
4    6
5   12
6    4
7    7
      foo
0      12
1  pruned
2       4
3  pruned
4  pruned
5      12
6       4
7  pruned
muskrat
  • 1,519
  • 11
  • 18
  • 1
    ```value_counts``` can give you frequency counts directly. Then you can carry on the filtering. – TrigonaMinima Jan 18 '18 at 18:06
  • my only issue with this solution is that it requires that i add an additional column to my original dataframe. Is there anyway to filter in place? – Ari Jan 18 '18 at 21:05
  • @Ari see my update above; this updated one-liner achieves what you are looking for without an additional column. Please accept answer if useful. – muskrat Jan 18 '18 at 22:54
0

This is the solution I ended up using based on the answer above.

import pandas as pd
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
# make a dict with counts
count_dict = dict(df.foo.value_counts())
# assign that dict to a column
df['temp_count'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.temp_count < 2, 'foo']='pruned'
df = df.drop(["temp_count"], axis=1)
Ari
  • 563
  • 2
  • 17