2

I have a dataframe, one of the columns is categorical in nature and many of those values in this column are repeating, however there are many values that have very small count i.e. one digit count whereas other values have count in 3 or 4 digits. I want to replace all the values in this categorical column that have value count of less than 10 with value 'other'. Below, I am trying to mimic my question here with an example dataframe.

Code for example dataframe is as follows :

l1,l2 = [1,2,3,4,5, 6 ,7, 8,9,10], ['aa', 'bb', 'aa', 'bb', 'bb', 'ee', 'bb','gg','gg','gg']
dataframe = pd.DataFrame(zip(l1,l2), columns =['l1','l2'])
dataframe.l2.value_counts()
l1 l2
1 'aa'
2 'bb'
3 'aa'
4 'bb'
5 'bb'
6 'ee'
7 'bb'
8 'gg'
9 'gg'
10 'gg'

Now if I print value_counts() for column 'l2' I will get count of every value in column 'l2'.

dataframe.l2.value_counts()

Output is : Value counts

My question is, how to replace all those values in this 'l2' column which have value count <3 with value 'other' My expected dataframe is :

l1 l2
1 'other'
2 'bb'
3 'other'
4 'bb'
5 'bb'
6 'other'
7 'bb'
8 'gg'
9 'gg'
10 'gg'

Here as you can see, all instances of values 'aa' and 'ee' are replaced with 'other' as their value count was less than 3. How to do this ?

2 Answers2

2

Here is a simple solution using value_counts, to_dict and replace DataFrame methods:

# creating sample dataframe
import pandas as pd

df = pd.DataFrame({"a": [0,1,2,2,3,3,4,4], "b": [0,5,2,2,3,3,4,5]})

Here is the result you want:

# now replacing all column values with value_counts() result from column b:
df.replace(df.b.value_counts().to_dict())

Details:

Value counts from b is converted to a dictionnary to be used as input in the replace() method. It look like this:

df.b.value_counts().to_dict()

{5: 2, 3: 2, 2: 2, 4: 1, 0: 1}

The final result looks like this:

enter image description here

Documentation:

fpajot
  • 698
  • 3
  • 14
1

Use Series.map with Series.value_counts for repeat counts per groups and then set values by Series.mask or DataFrame.loc:

mask = dataframe.l2.map(dataframe.l2.value_counts()) < 3
#alternative
#mask = dataframe.groupby('l2').l2.transform('size') < 3
dataframe.l2 =  dataframe.l2.mask(mask, 'other')
#alternative
#dataframe.loc[mask, 'l2'] = 'other'

print (dataframe)
   l1     l2
0   1  other
1   2     bb
2   3  other
3   4     bb
4   5     bb
5   6  other
6   7     bb
7   8     gg
8   9     gg
9  10     gg
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252