12

I have a moderately large (~60,000 rows by 15 columns) csv file that I'm working on with pandas. Each row represents an individual and contains personal data. I want to render the data anonymous. One way I want to do so is by replacing values in a particular column where they are rare. I initially tried to do so as follows:

def clean_data(entry):
    if df[df.column_name == entry].index.size < 10:
        return 'RARE_VALUE'
    else:
        return entry

df.new_column_name = df.column_name.apply(clean_data)

But running it froze my system every time. This unfortunately means I have no useful debugging data. Does anyone know the correct way to do this? The column contains both strings and null values.

Garry Cairns
  • 3,005
  • 1
  • 18
  • 33
  • You can [set up a progress meter for apply](http://stackoverflow.com/a/18611535/1240268), but this obviously slows down whatever it is you're doing. Generally a bad idea to return different types of data in an apply (here a string or a Series), it's unclear what you want the apply to return... – Andy Hayden Mar 05 '14 at 20:43
  • @AndyHayden Your comment suggests to me that I may not understand `apply` properly. My understanding was that my function would return the string 'RARE_VALUE' if the condition were met but keep the existing string/null if it weren't. Is this incorrect? – Garry Cairns Mar 05 '14 at 20:46
  • Ah wait, I see what you're saying, I mistook this for a DataFrame apply. No you're correct, but boolean masking at each step is **slow**!! – Andy Hayden Mar 05 '14 at 20:48

2 Answers2

12

I think you want to groupby column name:

g = df.groupby('column_name')

You can use a filter, for example, to return only those rows who have something in column_name which appears more than 10 times:

g.filter(lambda x: len(x) >= 10)

To overwrite the column with 'RARE_VALUE' you can use transform (which calculates the result once for each group, and spreads it around appropriately):

df.loc[g[col].transform(lambda x: len(x) < 10).astype(bool), col] = 'RARE_VALUE'

As DSM points out, the following trick is much faster:

df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"

Here's some timeit information (to show how impressive DSM's solution is!):

In [21]: g = pd.DataFrame(np.random.randint(1, 100, (1000, 2))).groupby(0)

In [22]: %timeit g.filter(lambda x: len(x) >= 10)
10 loops, best of 3: 67.2 ms per loop

In [23]: %timeit df.loc[g[1].transform(lambda x: len(x) < 10).values.astype(bool), 1]
10 loops, best of 3: 44.6 ms per loop

In [24]: %timeit df.loc[df[1].value_counts()[df[1]].values < 10, 1]
1000 loops, best of 3: 1.57 ms per loop
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 7
    `df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"` might be a little faster. – DSM Mar 05 '14 at 20:53
  • @DSM that is some proper voodoo magic... *surely* that's not faster! – Andy Hayden Mar 05 '14 at 20:54
  • Care to put some quatloos on that? – DSM Mar 05 '14 at 20:55
  • @AndyHayden that gives me a list of column names then `not in index`. @DSM your variant gives me a list of values from the column then `not in index` (after I'd done a fillna because originally wouldn't index with NA present)! – Garry Cairns Mar 05 '14 at 20:59
  • @DSM Wow, that is waaay faster. :o – Andy Hayden Mar 05 '14 at 20:59
  • Got it! I'd missed the quotes from the column name just before the closing bracket. Thanks for this. – Garry Cairns Mar 05 '14 at 21:14
  • @DSM are these *slightly* different? Will look again at this later, posted as issue https://github.com/pydata/pandas/issues/6556 – Andy Hayden Mar 05 '14 at 21:18
  • 1
    @AndyHayden: is there a reason that you do `.groupby(0)` and then `g[1]`? – DSM Mar 05 '14 at 21:28
  • Yeah... there is something strange going on, I think maybe there's a bug here. the transform returned a DataFrame otherwise – Andy Hayden Mar 05 '14 at 21:29
  • I'm not sure I totally understand what you're doing. Wouldn't the groupby equivalent be something like `df.groupby(0)[0].transform("count") < 10`? – DSM Mar 05 '14 at 21:46
  • @DSM ah, I see my mistake (confirming they should be the same if I wrote your answer correctly). I'm v. shocked yours is so much faster than transform and the transform returns 2d data in this case... (also confused by transform('count')!) – Andy Hayden Mar 05 '14 at 22:02
  • transform has to go back/forth to python space; @DSM soln is all indexing (and value_counts which is in cython) – Jeff Mar 05 '14 at 22:13
  • I am looking at @DSM solution and I am wondering how does it work? It seems like he is slicing a series with another series but it somehow replaces the second series inedx with the original series values... – Lim Kaizhuo Jan 08 '21 at 05:49
3

@Andy Hayden solves the issue in various ways. I would recommend using pipelines for this kind of task though. The following may seem more unwieldy, but it comes in handy if you want to save the whole pipeline as an object, or if you have to generalize predictions on a test set:

class RemoveScarceValuesFeatureEngineer:

    def __init__(self, min_occurences):
        self._min_occurences = min_occurences
        self._column_value_counts = {}

    def fit(self, X, y):
        for column in X.columns:
            self._column_value_counts[column] = X[column].value_counts()
        return self

    def transform(self, X):
        for column in X.columns:
            X.loc[self._column_value_counts[column][X[column]].values
                  < self._min_occurences, column] = "RARE_VALUE"

        return X

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

You may find more informations here: Pandas replace rare values

RUser4512
  • 1,050
  • 9
  • 23