Replace low frequency categorical values from pandas.dataframe while ignoring NaNs

Question

How can I replace the values from certain columns in a pandas.DataFrame that occur rarely, i.e. with low frequency (while ignoring NaNs)?

For example, in the following dataframe, suppose I wanted to replace any values in columns A or B that occur less than three times in their respective column. I want to replace these with "other":

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | peach | dog |
cherry | cat   | NaN |
NaN    | cat   | emu |
ant    | peach | emu |

In other words, in columns A and B, I want to replace those values that occur twice or less (but leave NaNs alone).

So the output I want is:

   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | other | dog |
other  | cat   | NaN |
NaN    | cat   | emu |
ant    | other | emu |

This is related to a previously posted question: Remove low frequency values from pandas.dataframe

but the solution there resulted in an "AttributeError: 'NoneType' object has no attribute 'any.'" (I think because I have NaN values?)

score 10 · Accepted Answer · edited May 23 '17 at 12:01

This is pretty similar to Change values in pandas dataframe according to value_counts(). You can add a condition to the lambda function to exclude column 'C' as follows:

df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out: 
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu

This basically iterates over columns. For each column, it generates value counts and uses that Series for mapping. This allows x.mask to check the condition whether the count is smaller than 3 or not. If that is the case, it returns 'other' and if not it uses the actual value. Lastly, a condition checks the column name.

lambda's condition can be generalized for multiple columns by changing it to x.name not in 'CDEF' or x.name not in ['C', 'D', 'E', 'F'] from x.name!='C'.

score 4 · Answer 2 · answered Jan 10 '17 at 20:34

using a helper function and replace

def replace_low_freq(df, threshold=2, replacement='other'):
    s = df.stack()
    c = s.value_counts()
    m = pd.Series(replacement, c.index[c <= threshold])
    return s.replace(m).unstack()

cols = list('AB')
replace_low_freq(df[cols]).join(df.drop(cols, 1))

       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3   None    cat  emu
4    ant  other  emu

jezrael · Answer 3 · 2017-01-10T20:22:14.937

You can use:

#added one last row for complicated df
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant', 'd'], 
                   'B':['cat','peach', 'cat', 'cat', 'peach', 'm'], 
                   'C':['dog','dog',pd.np.nan, 'emu', 'emu', 'k']})
print (df)
        A      B    C
0     ant    cat  dog
1     ant  peach  dog
2  cherry    cat  NaN
3     NaN    cat  emu
4     ant  peach  emu
5       d      m    k

Use value_counts with boolean indexing for find all values for replace:

a = df.A.value_counts()
a = a[a < 3].index
print (a)
Index(['cherry', 'd'], dtype='object')

b = df.B.value_counts()
b = b[b < 3].index
print (b)
Index(['peach', 'm'], dtype='object')

And then replace with dict comprehension if more values for replacing:

df.A = df.A.replace({x:'other' for x in a})
df.B = df.B.replace({x:'other' for x in b})
print (df)
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu
5  other  other    k

All together in loop:

cols = ['A','B']
for col in cols:
    val = df[col].value_counts()
    y = val[val < 3].index
    df[col] = df[col].replace({x:'other' for x in y})
print (df)
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu
5  other  other    k

Hmm, so this works on this sample df, but when I try to do this with my actual data, I get an error with the replace w/dict comprehension line: ValueError: not enough values to unpack (expected 2, got 0). Any idea what might be going on there? — Imu, Jan 10 '17 at 20:41
I am not sure, maybe is necessary convert to list - `df[col] = df[col].replace({x:'other' for x in y.tolist()})` — jezrael, Jan 10 '17 at 20:51

Replace low frequency categorical values from pandas.dataframe while ignoring NaNs

3 Answers3