dataframe in pandas with certain conditions

Question

am trying to combine features of in a dataframe to derive a new columns in the dataframe

I have this dataframe

Id   Author   News_post  Label
1    Jessica  xxxxxxxxx  1
2    Adams    xxxxxxxxx  1
3    Adams    xxxxxxxxx  1
4    Mike     xxxxxxxxx  0
5    James    xxxxxxxxx  1
6    Mike     xxxxxxxxx  1
7    Mike     xxxxxxxxx  0
8    Paul     xxxxxxxxx  0
9    Jessica  xxxxxxxxx  0
10   Adams    xxxxxxxxx  0

NB: where the Label column have 1=TRUE AND 0=FALSE

Id   Author   Num_Post  Num_True_Label  Num_False_Label   Mean
1    Adams    3         2               1                 x
2    James    1         1               0                 x
3    Jessica  2         1               1                 x
4    Mike     2         0               1                 x
5    Paul     1         0               0                 x

Duplicates https://stackoverflow.com/questions/39922986/pandas-group-by-and-sum/39923815. — Alex Fish, Jul 27 '19 at 21:37
https://stackoverflow.com/questions/48059985/pandas-create-boolean-column-using-groupby-transform look here — frankegoesdown, Jul 27 '19 at 21:39
Possible duplicate of [pandas create boolean column using groupby transform](https://stackoverflow.com/questions/48059985/pandas-create-boolean-column-using-groupby-transform) — M_S_N, Jul 27 '19 at 21:40
your data doesn't make much sense, the ID's in the first df and 2nd df don't match? — Umar.H, Jul 27 '19 at 21:49
the id's for the output does not matter. since the out is to create a new dataframe — umagba alex, Jul 27 '19 at 22:03

score 1 · Answer 1 · answered Jul 27 '19 at 21:41

This may solve a number of things you are trying to get from your issue:

df = pd.read_clipboard()  # just copied your dataframe
df = df.groupby('Author').describe()

Output:

           Id                                               Label                                               
        count      mean       std  min  25%  50%  75%   max count      mean       std  min   25%  50%   75%  max
Author                                                                                                          
Adams     3.0  5.000000  4.358899  2.0  2.5  3.0  6.5  10.0   3.0  0.666667  0.577350  0.0  0.50  1.0  1.00  1.0
James     1.0  5.000000       NaN  5.0  5.0  5.0  5.0   5.0   1.0  1.000000       NaN  1.0  1.00  1.0  1.00  1.0
Jessica   2.0  5.000000  5.656854  1.0  3.0  5.0  7.0   9.0   2.0  0.500000  0.707107  0.0  0.25  0.5  0.75  1.0
Mike      3.0  5.666667  1.527525  4.0  5.0  6.0  6.5   7.0   3.0  0.333333  0.577350  0.0  0.00  0.0  0.50  1.0
Paul      1.0  8.000000       NaN  8.0  8.0  8.0  8.0   8.0   1.0  0.000000       NaN  0.0  0.00  0.0  0.00  0.0

i want the code that do exactly what my output shows. thanks — umagba alex, Jul 27 '19 at 21:50

Gabriela Melo · Answer 2 · 2019-07-27T22:14:27.450

The following will get you what you need:

In [1]: import pandas as pd                                                                                                                                                                                                                  

In [2]: df = pd.DataFrame({'Author': ['Jessica', 'Adams', 'Adams', 'Mike', 'James', 'Mike', 'Mike', 'Paul', 'Jessica', 'Adams'], 'News_post': ['xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xx
    ...: xxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx'], 'Label': [1,1,1,0,1,1,0,0,0,0]})                                                                                                                                                                

In [3]: num_true_label_df = df.groupby(by=['Author']).sum().rename(columns={'Label': 'Num_True_Label'}).reset_index()                                                                                                                        

In [4]: num_post_df = df.groupby(by=['Author']).count().rename(columns={'News_post': 'Num_Post'})[['Num_Post']].reset_index()                                                                                                                

In [5]: df = pd.merge(num_post_df, num_true_label_df, how='left', on='Author').reset_index().rename(columns={'index': 'Id'})

In [6]: df['Id'] = df['Id'] + 1

In [7]: df['Num_False_Label'] = df['Num_Post'] - df['Num_True_Label']

In [8]: df                                                                                                                                                                                                                                
Out[7]: 
   Id   Author  Num_Post  Num_True_Label  Num_False_Label
0   1    Adams         3               2                1
1   2    James         1               1                0
2   3  Jessica         2               1                1
3   4     Mike         3               1                2
4   5     Paul         1               0                1

Please further specify what your Mean column should represent.

Some resources which might be helpful: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

i tried using the code but i got this error: KeyError: 'Num Post' — umagba alex, Jul 27 '19 at 22:40
From your comment, it seems that you were trying to access column 'Num Post', is that right? From the way I established the dataframe above, the column is actually called 'Num_Post' (notice the underscore between the words) — Gabriela Melo, Jul 27 '19 at 22:50
When is this error happening? (What's the line of code you're trying to run?) — Gabriela Melo, Jul 27 '19 at 23:30

score 1 · Answer 3 · answered Jul 27 '19 at 22:58

Using Pandas 0.25 with aggregation relabeling

df.groupby('Author')['Label'].agg(Num_Post = 'size',
                                  Num_True = 'sum',
                                  Num_False = lambda x: x.eq(0).sum(),
                                  Mean = 'mean')

Output:

         Num_Post  Num_True  Num_False      Mean
Author                                          
Adams           3         2          1  0.666667
James           1         1          0  1.000000
Jessica         2         1          1  0.500000
Mike            3         1          2  0.333333
Paul            1         0          1  0.000000

got this error: TypeError: aggregate() takes at least 2 arguments (1 given) — umagba alex, Jul 27 '19 at 23:30

adhg · Accepted Answer · 2019-07-27T23:16:58.220

Use transform and then remove the duplicates such that:

df['Num_Post']= df.groupby(['Author'])['Label'].transform('count')
df['Num_True_Label']= df.groupby(['Author'])['Label'].transform('sum')
df['Num_False_Label']= df['Num_Post']-df['Num_True_Label']
df['Mean']= df['Num_Post']/df['Num_True_Label']

Finally: drop dups and remove the News_post

df.drop(columns=['News_post'], inplace=True)
df.drop_duplicates(subset='Author', keep='first').sort_values(by=['Author'])

result:

    Id  Author      Label   Num_Post    Num_True_Label  Num_False_Label Mean
    1   2   Adams       1       3           2               1               1.500000
    4   5   James       1       1           1               0               1.000000
    0   1   Jessica     1       2           1               1               2.000000
    3   4   Mike        0       3           1               2               3.000000
    7   8   Paul        0       1           0               1               inf

Note: change the mean for your definition.

thanks, it works but the value of the resultis not correct. and the output does not need the column for News_Post because the column will show no significant in the output — umagba alex, Jul 27 '19 at 23:11
just drop the column. see my edit (you can also drop the label with: df.drop(columns=['News_post','Label'], inplace=True) — adhg, Jul 27 '19 at 23:17

Ayoub ZAROU · Answer 5 · 2019-07-27T23:17:23.367

0

you could try :

agg_df = df.groupby('Author')['Label'].agg({"Num_post" : 'count', 'Num_True_Label' : 
                                             lambda x : x.eq(1).sum(), 
                                            'Num_False_Label':lambda x : 
                                            x.eq(0).sum(), 
                                            'Mean':'mean'}).reset_index()

edited Jul 27 '19 at 23:17

answered Jul 27 '19 at 22:54

Ayoub ZAROU

2,387
6
20

dataframe in pandas with certain conditions

5 Answers5