2

Given a dataframe:

df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})

I'd like to replace every value in column 'a' by the majority of values around 'a'. For numerical data, I can do this:

def majority(window):
    freqs = scipy.stats.itemfreq(window)
    max_votes = freqs[:,1].argmax()
    return freqs[max_votes,0]

df['a'] = pd.rolling_apply(df['a'], 3, majority)

And I get:

In [43]: df
Out[43]: 
     a
0  NaN
1  NaN
2    1
3    1
4    1
5    1
6    1
7    2
8    2
9    2
10   2

I'll have to deal with the NaNs, but apart from that, this is more or less what I want... Except, I'd like to do the same thing with non-numerical columns, but Pandas does not seem to support this:

In [47]: df['b'] = list('aaaababbbba')
In [49]: df['b'] = pd.rolling_apply(df['b'], 3, majority)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-49-507f45aab92c> in <module>()
----> 1 df['b'] = pd.rolling_apply(df['b'], 3, majority)

/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in rolling_apply(arg, window, func, min_periods, freq, center, args, kwargs)
    751         return algos.roll_generic(arg, window, minp, offset, func, args, kwargs)
    752     return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 753                            center=False, args=args, kwargs=kwargs)
    754 
    755 

/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwargs, **kwds)
    382     arg = _conv_timerule(arg, freq, how)
    383 
--> 384     return_hook, values = _process_data_structure(arg)
    385 
    386     if values.size == 0:

/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _process_data_structure(arg, kill_inf)
    433 
    434     if not issubclass(values.dtype.type, float):
--> 435         values = values.astype(float)
    436 
    437     if kill_inf:

ValueError: could not convert string to float: a

I've tried converting a to a Categorical, but even then I get the same error. I can first convert to a Categorical, work on the codes and finally convert back from codes to labels, but that seems really convoluted.

Is there an easier/more natural solution?

(BTW: I'm limited to NumPy 1.8.2 so I have to use itemfreq instead of unique, see here.)

Community
  • 1
  • 1
Johannes Bauer
  • 462
  • 5
  • 15

2 Answers2

7

Here is a way, using pd.Categorical:

import scipy.stats as stats
import pandas as pd

def majority(window):
    freqs = stats.itemfreq(window)
    max_votes = freqs[:,1].argmax()
    return freqs[max_votes,0]

df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['a'] = pd.rolling_apply(df['a'], 3, majority)
df['b'] = list('aaaababbbba')

cat = pd.Categorical(df['b'])
df['b'] = pd.rolling_apply(cat.codes, 3, majority)
df['b'] = df['b'].map(pd.Series(cat.categories))
print(df)

yields

     a    b
0  NaN  NaN
1  NaN  NaN
2    1    a
3    1    a
4    1    a
5    1    a
6    1    b
7    2    b
8    2    b
9    2    b
10   2    b
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • That's a nice workaround using categories' `codes` and `categories` to convert to numeric and back. +1 – pansen Jun 19 '17 at 09:14
2

Here is one way to do it by defining your own rolling apply function.

import pandas as pd

df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['b'] = np.where(df.a == 1, 'A', 'B')

print(df)

Out[60]: 
    a  b
0   1  A
1   1  A
2   1  A
3   1  A
4   1  A
5   2  B
6   1  A
7   2  B
8   2  B
9   2  B
10  2  B

def get_mode_from_Series(series):
    return series.value_counts().index[0]

def my_rolling_apply_char(frame, window, func):
    index = frame.index[window-1:]
    values = [func(frame.iloc[i:i+window]) for i in range(len(frame)-window+1)]
    return pd.Series(data=values, index=index).reindex(frame.index)

my_rolling_apply_char(df.b, 3, get_mode_from_Series)

Out[61]: 
0     NaN
1     NaN
2       A
3       A
4       A
5       A
6       A
7       B
8       B
9       B
10      B
dtype: object
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • I suppose going through Categorical is the least bulky way to do it, afterall, but I'm accepting this since I explicitly asked for something different. BTW: I'm hazy on the whole indexing business in Pandas: could you explain what the `index=index` and `reindex()` bit do? – Johannes Bauer Jul 06 '15 at 06:54
  • 1
    @JohannesBauer `index=index` forces the returned `pd.Series` to have index `2,3,...,10` rather than default integer index `0,1,...,8`. The last `reindex` part tries to align the index with original `df` and populates unseen index with `NaN`. – Jianxun Li Jul 06 '15 at 07:23
  • Thanks a lot, @Jianxun. – Johannes Bauer Jul 06 '15 at 07:28
  • this is a god send. still no idea what my_rolling_apply_char does but it makes rolling_apply work on strings and thats all i care about! – swyx Jul 05 '16 at 23:52