Given a dataframe:
df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
I'd like to replace every value in column 'a' by the majority of values around 'a'. For numerical data, I can do this:
def majority(window):
freqs = scipy.stats.itemfreq(window)
max_votes = freqs[:,1].argmax()
return freqs[max_votes,0]
df['a'] = pd.rolling_apply(df['a'], 3, majority)
And I get:
In [43]: df
Out[43]:
a
0 NaN
1 NaN
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
I'll have to deal with the NaN
s, but apart from that, this is more or less what I want... Except, I'd like to do the same thing with non-numerical columns, but Pandas does not seem to support this:
In [47]: df['b'] = list('aaaababbbba')
In [49]: df['b'] = pd.rolling_apply(df['b'], 3, majority)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-507f45aab92c> in <module>()
----> 1 df['b'] = pd.rolling_apply(df['b'], 3, majority)
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in rolling_apply(arg, window, func, min_periods, freq, center, args, kwargs)
751 return algos.roll_generic(arg, window, minp, offset, func, args, kwargs)
752 return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 753 center=False, args=args, kwargs=kwargs)
754
755
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwargs, **kwds)
382 arg = _conv_timerule(arg, freq, how)
383
--> 384 return_hook, values = _process_data_structure(arg)
385
386 if values.size == 0:
/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _process_data_structure(arg, kill_inf)
433
434 if not issubclass(values.dtype.type, float):
--> 435 values = values.astype(float)
436
437 if kill_inf:
ValueError: could not convert string to float: a
I've tried converting a
to a Categorical
, but even then I get the same error. I can first convert to a Categorical
, work on the codes
and finally convert back from codes to labels, but that seems really convoluted.
Is there an easier/more natural solution?
(BTW: I'm limited to NumPy 1.8.2 so I have to use itemfreq
instead of unique
, see here.)