1

Is there a way to do a pandas groupby aggregate on a dataframe and returning a certain string from a column? I have a dataframe like so:

lst = [[ 100, 'buicks', .001, np.nan, np.nan], [101, 'chevy', .002, np.nan, np.nan],
       [102, 'olds', .003, .006, np.nan], [100, 'buick', np.nan, .004, np.nan],
       [103, 'nissan', np.nan, np.nan, .1], [103,'nissans', np.nan, .14, np.nan]]
df = pd.DataFrame(lst, columns=['car_id', 'name', 'aa', 'bb', 'cc'])
   car_id     name     aa     bb   cc
0     100   buicks  0.001    NaN  NaN
1     101    chevy  0.002    NaN  NaN
2     102     olds  0.003  0.006  NaN
3     100    buick    NaN  0.004  NaN
4     103   nissan    NaN    NaN  0.1
5     103  nissans    NaN  0.140  NaN

Need this:

0     100   buicks  0.001  0.004  NaN
1     101    chevy  0.002    NaN  NaN
2     102     olds  0.003  0.006  NaN
4     103  nissans    NaN  0.140  0.1

What I am trying to do is groupby the car_id column and then sum columns aa, bb, cc. However, the name column values may not be the same but I need to keep one of them. I don't care which one. I was looking at: Pandas sum by groupby, but exclude certain columns and ended up with something like this:

df.groupby('car_id').agg({'aa': np.sum, 'bb': np.sum, 'cc':np.sum})

But this is dropping the name column. I assume that I can add the name column to the above statement and there is an operation I can put in there to return the string.

Thanks

othnin
  • 73
  • 1
  • 11

4 Answers4

3

Yes, indeed you can use first for the name column:

df.groupby('car_id').agg({'name':'first',
                          'aa':'sum',
                          'bb':'sum',
                          'cc':'sum'})

Output:

          name     aa     bb   cc
car_id                           
100     buicks  0.001  0.004  0.0
101      chevy  0.002  0.000  0.0
102       olds  0.003  0.006  0.0
103     nissan  0.000  0.140  0.1
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
3

Here with lambda in case you want NaN in there:

df.groupby('car_id').agg({'name':'first', 'aa':'sum', 'bb':'sum', 'cc': lambda x: x.sum(skipna=False)})
MGs
  • 290
  • 1
  • 3
  • 16
0

To meet your output

s=df.groupby(['car_id'])[['aa','bb','cc']].sum(min_count=1)
s['name']=df.drop_duplicates('car_id').set_index('car_id').name
s
Out[185]: 
           aa     bb   cc    name
car_id                           
100     0.001  0.004  NaN  buicks
101     0.002    NaN  NaN   chevy
102     0.003  0.006  NaN    olds
103       NaN  0.140  0.1  nissan
BENY
  • 317,841
  • 20
  • 164
  • 234
0
pd.concat([df.groupby('car_id').apply(lambda x: x.head(1))[['car_id','name']].set_index('car_id'), \
df.groupby('car_id').agg({'aa': np.sum, 'bb': np.sum, 'cc':np.sum}).reset_index().set_index('car_id')], axis = 1).reset_index()


    car_id  name    aa  bb  cc
0   100 buicks  0.001   0.004   0.0
1   101 chevy   0.002   0.000   0.0
2   102 olds    0.003   0.006   0.0
3   103 nissan  0.000   0.140   0.1
theDBA
  • 239
  • 1
  • 5