Remove low frequency values from pandas.dataframe

Question

How can I remove values from a column in pandas.DataFrame, that occurs rarely, i.e. with a low frequency? Example:

In [4]: df[col_1].value_counts()

Out[4]: 0       189096
        1       110500
        2        77218
        3        61372
              ...
        2065         1
        2067         1
        1569         1
        dtype: int64

So, my question is: how to remove values like 2065, 2067, 1569 and others? And how can I do this for ALL columns, that contain .value_counts() like this?

UPDATE: About 'low' I mean values like 2065. This value occurs in col_1 1 (one) times and I want to remove values like this.

What do you mean by "meet rarely"? Can you be more specific? Adding some code that you already tried might help. — the, Sep 10 '15 at 20:58
I can't understand how can be "more specific"? In ```col_1``` value like ```0``` meet ```189096``` times. And value like ```1569``` meet ONE times. I want to remove ```1569``` and other values like this in ```col_1```. — Gilaztdinov Rustam, Sep 10 '15 at 21:02
Now if you provide a self-containing piece of code that generates a similar data structure it will be much easier to come up with an answer. — the, Sep 10 '15 at 21:04
OK. My English is bad and I know it. But I work about it =). So, what about my question? — Gilaztdinov Rustam, Sep 10 '15 at 21:05
Ideally the question would provide a self-contained piece of code generating the data structure, or even just something like `df = [[1, 2], [2, 3], [4, 5]]`, enough to try to get an answer without diving into panda. — the, Sep 10 '15 at 21:09

score 34 · Accepted Answer · edited Apr 07 '19 at 13:37

I see there are two ways you might want to do this.

For the entire DataFrame

This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame 
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)

Column-by-column

This method removes the entries that occur infrequently in each column.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
    value_counts = df[col].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    df[col].replace(to_remove, np.nan, inplace=True)

Hi, I'm missing a piece of logic, can someone please explain why you need the ".index" (or ".values" below) for the code to work, TIA — eli, Dec 24 '19 at 21:18
@eli - it's because value_counts is a series and the .index is the value that you want to remove. — morganics, Jan 16 '20 at 15:12

score 8 · Answer 2 · answered Sep 10 '15 at 21:12

You probably don't want to remove the entire row in your DataFrame if only one column has values below your threshold, so I've simply removed these data points and replaced them with None.

I loop through each column and perform a value_counts on each. I then get the index values for each items that occurs at or below the target threshold values. Finally, I use .loc to locate these elements values in the column and then replace them with None.

df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 
                   'B': ['a', 'a', 'b', 'c', 'c'], 
                   'C': ['a', 'a', 'b', 'b', 'c']})

>>> df
   A  B  C
0  a  a  a
1  b  a  a
2  b  b  b
3  c  c  b
4  c  c  c

threshold = 1  # Remove items less than or equal to threshold
for col in df:
    vc = df[col].value_counts()
    vals_to_remove = vc[vc <= threshold].index.values
    df[col].loc[df[col].isin(vals_to_remove)] = None

>>> df
      A     B     C
0  None     a     a
1     b     a     a
2     b  None     b
3     c     c     b
4     c     c  None

Thank you! That exactly what I asked – Gilaztdinov Rustam Sep 10 '15 at 21:25 — Gilaztdinov Rustam, Sep 10 '15 at 21:25

score 0 · Answer 3 · answered Jul 16 '22 at 15:57

With apply this can be accomplished in one line. value_counts returns the counts of unique values. The occurences are mapped to each value with map. where replaces all values where the condition is not met (here: number of occurences is greater than 1) with NaN.

In [6]: df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 'B': ['a', 'a', 'b', 'c', 'c'],  'C': ['a', 'a', 'b', 'b', 'c']})

In [7]: df
Out[7]: 
   A  B  C
0  a  a  a
1  b  a  a
2  b  b  b
3  c  c  b
4  c  c  c

In [8]: df.apply(lambda col: col.where(col.map(col.value_counts()) > 1))
Out[8]: 
     A    B    C
0  NaN    a    a
1    b    a    a
2    b  NaN    b
3    c    c    b
4    c    c  NaN

Credits to @Alexander for the dataset.

Remove low frequency values from pandas.dataframe

3 Answers3

Linked