1

am trying to combine features of in a dataframe to derive a new columns in the dataframe

I have this dataframe

Id   Author   News_post  Label
1    Jessica  xxxxxxxxx  1
2    Adams    xxxxxxxxx  1
3    Adams    xxxxxxxxx  1
4    Mike     xxxxxxxxx  0
5    James    xxxxxxxxx  1
6    Mike     xxxxxxxxx  1
7    Mike     xxxxxxxxx  0
8    Paul     xxxxxxxxx  0
9    Jessica  xxxxxxxxx  0
10   Adams    xxxxxxxxx  0

NB: where the Label column have 1=TRUE AND 0=FALSE

Id   Author   Num_Post  Num_True_Label  Num_False_Label   Mean
1    Adams    3         2               1                 x
2    James    1         1               0                 x
3    Jessica  2         1               1                 x
4    Mike     2         0               1                 x
5    Paul     1         0               0                 x
  • Duplicates https://stackoverflow.com/questions/39922986/pandas-group-by-and-sum/39923815. – Alex Fish Jul 27 '19 at 21:37
  • https://stackoverflow.com/questions/48059985/pandas-create-boolean-column-using-groupby-transform look here – frankegoesdown Jul 27 '19 at 21:39
  • Possible duplicate of [pandas create boolean column using groupby transform](https://stackoverflow.com/questions/48059985/pandas-create-boolean-column-using-groupby-transform) – M_S_N Jul 27 '19 at 21:40
  • your data doesn't make much sense, the ID's in the first df and 2nd df don't match? – Umar.H Jul 27 '19 at 21:49
  • the id's for the output does not matter. since the out is to create a new dataframe – umagba alex Jul 27 '19 at 22:03

5 Answers5

1

This may solve a number of things you are trying to get from your issue:

df = pd.read_clipboard()  # just copied your dataframe
df = df.groupby('Author').describe()

Output:

           Id                                               Label                                               
        count      mean       std  min  25%  50%  75%   max count      mean       std  min   25%  50%   75%  max
Author                                                                                                          
Adams     3.0  5.000000  4.358899  2.0  2.5  3.0  6.5  10.0   3.0  0.666667  0.577350  0.0  0.50  1.0  1.00  1.0
James     1.0  5.000000       NaN  5.0  5.0  5.0  5.0   5.0   1.0  1.000000       NaN  1.0  1.00  1.0  1.00  1.0
Jessica   2.0  5.000000  5.656854  1.0  3.0  5.0  7.0   9.0   2.0  0.500000  0.707107  0.0  0.25  0.5  0.75  1.0
Mike      3.0  5.666667  1.527525  4.0  5.0  6.0  6.5   7.0   3.0  0.333333  0.577350  0.0  0.00  0.0  0.50  1.0
Paul      1.0  8.000000       NaN  8.0  8.0  8.0  8.0   8.0   1.0  0.000000       NaN  0.0  0.00  0.0  0.00  0.0
Sazzy
  • 1,924
  • 3
  • 19
  • 27
1

The following will get you what you need:

In [1]: import pandas as pd                                                                                                                                                                                                                  

In [2]: df = pd.DataFrame({'Author': ['Jessica', 'Adams', 'Adams', 'Mike', 'James', 'Mike', 'Mike', 'Paul', 'Jessica', 'Adams'], 'News_post': ['xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xx
    ...: xxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx'], 'Label': [1,1,1,0,1,1,0,0,0,0]})                                                                                                                                                                

In [3]: num_true_label_df = df.groupby(by=['Author']).sum().rename(columns={'Label': 'Num_True_Label'}).reset_index()                                                                                                                        

In [4]: num_post_df = df.groupby(by=['Author']).count().rename(columns={'News_post': 'Num_Post'})[['Num_Post']].reset_index()                                                                                                                

In [5]: df = pd.merge(num_post_df, num_true_label_df, how='left', on='Author').reset_index().rename(columns={'index': 'Id'})

In [6]: df['Id'] = df['Id'] + 1

In [7]: df['Num_False_Label'] = df['Num_Post'] - df['Num_True_Label']

In [8]: df                                                                                                                                                                                                                                
Out[7]: 
   Id   Author  Num_Post  Num_True_Label  Num_False_Label
0   1    Adams         3               2                1
1   2    James         1               1                0
2   3  Jessica         2               1                1
3   4     Mike         3               1                2
4   5     Paul         1               0                1


Please further specify what your Mean column should represent.

Some resources which might be helpful: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Gabriela Melo
  • 588
  • 1
  • 4
  • 15
1

Using Pandas 0.25 with aggregation relabeling

df.groupby('Author')['Label'].agg(Num_Post = 'size',
                                  Num_True = 'sum',
                                  Num_False = lambda x: x.eq(0).sum(),
                                  Mean = 'mean')

Output:

         Num_Post  Num_True  Num_False      Mean
Author                                          
Adams           3         2          1  0.666667
James           1         1          0  1.000000
Jessica         2         1          1  0.500000
Mike            3         1          2  0.333333
Paul            1         0          1  0.000000
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
0

Use transform and then remove the duplicates such that:

df['Num_Post']= df.groupby(['Author'])['Label'].transform('count')
df['Num_True_Label']= df.groupby(['Author'])['Label'].transform('sum')
df['Num_False_Label']= df['Num_Post']-df['Num_True_Label']
df['Mean']= df['Num_Post']/df['Num_True_Label']

Finally: drop dups and remove the News_post

df.drop(columns=['News_post'], inplace=True)
df.drop_duplicates(subset='Author', keep='first').sort_values(by=['Author'])

result:

    Id  Author      Label   Num_Post    Num_True_Label  Num_False_Label Mean
    1   2   Adams       1       3           2               1               1.500000
    4   5   James       1       1           1               0               1.000000
    0   1   Jessica     1       2           1               1               2.000000
    3   4   Mike        0       3           1               2               3.000000
    7   8   Paul        0       1           0               1               inf

Note: change the mean for your definition.

adhg
  • 10,437
  • 12
  • 58
  • 94
  • thanks, it works but the value of the resultis not correct. and the output does not need the column for News_Post because the column will show no significant in the output – umagba alex Jul 27 '19 at 23:11
  • just drop the column. see my edit (you can also drop the label with: df.drop(columns=['News_post','Label'], inplace=True) – adhg Jul 27 '19 at 23:17
0

you could try :

agg_df = df.groupby('Author')['Label'].agg({"Num_post" : 'count', 'Num_True_Label' : 
                                             lambda x : x.eq(1).sum(), 
                                            'Num_False_Label':lambda x : 
                                            x.eq(0).sum(), 
                                            'Mean':'mean'}).reset_index()
Ayoub ZAROU
  • 2,387
  • 6
  • 20