0

I have one column Dataframe of size 5 milliom rows. I want reduce it to 25k rows by aggregating each 200 rows into one (25k x 200 = 5 000 000 ). This row value should take to class label that is most frequent in all 200 rows.

Example :

import pandas as pd

df = pd.DataFrame({'a' :['s','s','t','s','s','t','s','t','t','w','w','t','w','s','d']})
print(df)

Out[60]: 
     a
0   s
1   s
2   t
3   s
4   s
5   t
6   s
7   t
8   t
9   w
10  w
11  t
12  w
13  s
14  w

I want to do something like this (an example) :

my_rolling_apply(my_column , widow_size= 3, function= majority_voted_class)

To get as output :

Out[2]: 
   a
0  s
1  s
2  t
3  w
4  w

The question is how can do this ? is there any function that can handle this task ?

Update :

The only issue here is that I need to control the size of the groups. And the grouping should output equal sized group to assign the most common label in each group.

smerllo
  • 3,117
  • 1
  • 22
  • 37
  • What have you tried so far? Where exactly is the problem? – Ralf Oct 16 '18 at 19:45
  • I could not find the appropriate function for this : – smerllo Oct 16 '18 at 19:46
  • I know about functions like pd.rolling_apply but it seem not what I am looking for – smerllo Oct 16 '18 at 19:47
  • 1
    From [this question](https://stackoverflow.com/questions/15222754/group-by-pandas-dataframe-and-select-most-common-string-factor) you may be able to use something like `df.groupby('a').agg(lambda x:x.value_counts().index[0])` – G. Anderson Oct 16 '18 at 20:00
  • Correct. The only issue is that I need to control the size of the groups. The grouping should output equal sized group to assign the most common label in each group. – smerllo Oct 16 '18 at 20:02

0 Answers0