5

I have following pandas dataframe :

import pandas as pd 
from pandas import Series, DataFrame

data = DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
              'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
              'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})

I'd like to change values in columns Qu1,Qu2,Qu3 according to value_counts() when value count great or equal some number

For example for Qu1 column

>>> pd.value_counts(data.Qu1) >= 2
cheese     True
potato     True
banana     True
apple     False
egg       False

I'd like to keep values cheese,potato,banana, because each value has at least two appearances.

From values apple and egg I'd like to create valueothers

For column Qu2 no changes :

>>> pd.value_counts(data.Qu2) >= 2
banana     True
apple      True
sausage    True

The final result as in attached test_data

test_data = DataFrame({'Qu1': ['other', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'other'],
                  'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
                  'Qu3': ['other', 'potato', 'other', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'other']})

Thanks !

Toren
  • 6,648
  • 12
  • 41
  • 62

2 Answers2

13

I would create a dataframe of same shape where the corresponding entry is the value count:

data.apply(lambda x: x.map(x.value_counts()))
Out[229]: 
   Qu1  Qu2  Qu3
0    1    2    1
1    2    4    3
2    3    3    1
3    2    3    3
4    3    3    3
5    2    2    3
6    3    4    3
7    2    4    3
8    1    4    1

And, use the results in df.where to return "other" where the corresponding entry is smaller than 2:

data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other")

      Qu1      Qu2     Qu3
0   other  sausage   other
1  potato   banana  potato
2  cheese    apple   other
3  banana    apple  cheese
4  cheese    apple  cheese
5  banana  sausage  potato
6  cheese   banana  cheese
7  potato   banana  potato
8   other   banana   other
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Quite a bit more elegant & faster than my approach with `.replace()`! – Stefan May 15 '16 at 18:39
  • @StefanJansen Thanks. :) In my experience, `.replace()` is generally slower than `.map()` so I tend to use map when both are possible. Though I still think the apply-map-value_counts combination may be repeating things but I couldn't find a better alternative. – ayhan May 15 '16 at 18:56
  • Thanks ! Elegant solution. How the `.where() with >=2` works ? – Toren May 16 '16 at 06:15
  • `df.where()` takes a condition. When this condition holds, returns values from the original dataframe; otherwise it returns from another dataframe or a constant. For example `df.where(df>5, df2)` would return a dataframe where the values are from df if they are greater than 5, or from df2 if the values are smaller than or equal to 5. It is equivalent to `df[df<=5] = df2`. – ayhan May 16 '16 at 07:22
  • I added `inplace ="True"` and got `TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value` – Toren May 16 '16 at 07:59
  • You can try assigning directly: `data = data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other")` – ayhan May 16 '16 at 09:49
  • I'd like to execute `where` on several columns only , and exclude others . What a best way to do it ? – Toren May 19 '16 at 06:28
  • It looks like your issue has been resolved with @EdChum's answer http://stackoverflow.com/a/37318069/2285236 – ayhan May 19 '16 at 17:25
2

You could:

value_counts = df.apply(lambda x: x.value_counts())

         Qu1  Qu2  Qu3
apple    1.0  3.0  1.0
banana   2.0  4.0  NaN
cheese   3.0  NaN  3.0
egg      1.0  NaN  1.0
potato   2.0  NaN  3.0
sausage  NaN  2.0  1.0

Then build a dictionary that will contain the replacements for each column:

import cycle
replacements = {}
for col, s in value_counts.items():
    if s[s<2].any():
        replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other'])))

replacements
{'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}}

Use the dictionary to replace the values:

df.replace(replacements)

      Qu1      Qu2     Qu3
0   other  sausage   other
1  potato   banana  potato
2  cheese    apple   other
3  banana    apple  cheese
4  cheese    apple  cheese
5  banana  sausage  potato
6  cheese   banana  cheese
7  potato   banana  potato
8   other   banana   other

or wrap the loop in a dictionary comprehension:

from itertools import cycle

df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})

However, this is not only more cumbersome but also slower than using .where. Testing with 3,000 columns:

df = pd.concat([df for i in range(1000)], axis=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 3000 entries, Qu1 to Qu3
dtypes: object(3000)

Using .replace():

%%timeit
value_counts = df.apply(lambda x: x.value_counts())
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})

1 loop, best of 3: 4.97 s per loop

vs .where():

%%timeit
df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other")

1 loop, best of 3: 2.01 s per loop
Stefan
  • 41,759
  • 13
  • 76
  • 81