2
     rev_id  worker_id  toxicity  toxicity_score
0    2232.0        723         0             0.0
1    2232.0       4000         0             0.0
2    2232.0       3989         0             1.0
3    2232.0       3341         0             0.0
4    2232.0       1574         0             1.0
5    2232.0       1508         0             1.0
6    2232.0        772         0             1.0
7    2232.0        680         0             0.0
8    2232.0        405         0             1.0
9    2232.0       4020         1            -1.0
10   4216.0        500         0             0.0
11   4216.0        599         0             0.0
12   4216.0        339         0             2.0
13   4216.0        257         0             0.0
14   4216.0        303         0             1.0
15   4216.0        188         0             0.0
16   4216.0       1549         0             1.0
17   4216.0         64         0             1.0
18   4216.0       1527         0             0.0
19   4216.0       1502         0             0.0
20   8953.0       2596         0             1.0
21   8953.0       2403         0             0.0
22   8953.0       2539         0             0.0
23   8953.0       2542         0             0.0
24   8953.0       2544         0             0.0
25   8953.0       1016         0             0.0
26   8953.0       2550         0             0.0
27   8953.0       2578         0             0.0
28   8953.0       2494         0             0.0
29   8953.0        971         0             0.0

I want to get the mode number (either 1 or 0) from toxicity and the mean from toxicity_score group by rev_id via pandas. How can I do this ? Thanks.

Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
yanachen
  • 3,401
  • 8
  • 32
  • 64

1 Answers1

3

It seems you need groupby with aggregate by agg mean and mode:

df = (df.groupby('rev_id', as_index=False)
        .agg({'toxicity_score':'mean', 'toxicity': lambda x: x.mode()}))

Alternative is value_counts with select first value of index:

df = (df.groupby('rev_id', as_index=False)
        .agg({'toxicity_score':'mean', 'toxicity': lambda x: x.value_counts().index[0]}))

print (df)
   rev_id  toxicity_score  toxicity
0  2232.0             0.4         0
1  4216.0             0.5         0
2  8953.0             0.1         0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • After the operation, rev_id is not a column again. How to convert the result to three columns – yanachen Jan 12 '18 at 11:48
  • Please check last edit. – jezrael Jan 12 '18 at 11:50
  • mode() returns me two number like [0,1]. I just want to get the most common number group by the rev_id – yanachen Jan 12 '18 at 12:02
  • It seems you need `x.mode()[0]` or upgrade pandas to last version, in `o.22.0` it working nice. – jezrael Jan 12 '18 at 12:06
  • Thanks. If there are five 1 and five 0 in one rev_id, what will the order be in x.mode() ? – yanachen Jan 12 '18 at 12:08
  • @yanachen - sorry, I have no idea. But if check [this](https://github.com/pandas-dev/pandas/blob/461221dd58d7fa7fcc247a7eec0e409309e82394/pandas/core/algorithms.py#L638) it seems this is first value of sorted values. But not 100% sure. – jezrael Jan 12 '18 at 12:14