4

I am trying to aggregate an entire dataframe using pandas, without grouping by anything.

I do need different functions for different columns so I'm using a dictionary, however passing 'first' or 'last' as aggregation functions throws a ValueError: no results, while others such as 'min'/'max'/'mean' give no problem.

This is a simplification of the code.

df = pd.DataFrame({'Col1':[1,2,3,4], 'Col2':[5,6,7,8], 'Col3':[9,10,11,12]})

func = {col: ['first', 'last'] if col in ['Col1']
             else ['first', 'last', 'mean'] if col in ['Col2']
             else 'mean' for col in df.columns}

result = df.agg(func)

Using

result = df.groupby(lambda _ : True).agg(func)

does the job but is quite slow, I assume due to the groupby. The dataframe is already a subset of a larger dataframe that cannot be further grouped.

I have hundreds of columns, I cannot aggregate them individually.

Is there another way to obtain the first and last row, as well as different aggregations, in a faster/more efficient way than grouping?

For a sample dataframe like this

   Col1  Col2  Col3
0     1     5     9
1     2     6    10
2     3     7    11
3     4     8    12

The output should be

      Col1       Col2            Col3
     first last first last mean  mean
True     1    4     5    8  6.5  10.5

Edit: As the original groupby functions would do, no null values/columns should be removed.

1 Answers1

0

Update:

df = pd.DataFrame({'Col1':[1,2,3,4], 'Col2':[5,6,7,8], 'Col3':[9,10,11,12]})
group_1 = ['Col1']
group_2 = ['col2']
func = {col:[fvalue, lvalue] if col in group_1
         else [fvalue, lvalue, 'mean'] if col in group_2
         else 'mean' for col in df.columns}

df.agg(func).unstack().to_frame().dropna().T

Output:

    Col1        Col2  Col3
  fvalue lvalue mean  mean
0    1.0    4.0  6.5  10.5

Let's see if using custom functions without using groupby will help things a little:

def fvalue(x):
    return x.iloc[0]

def lvalue(x):
    return x.iloc[-1]

func = {col:[fvalue, lvalue] if col in group_1
         else [fvalue, lvalue, 'mean'] if col in group_2
         else 'mean' for col in df.columns}

df.agg(func)
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Hi, thank you for your reply. It seems like instead of performing one aggregation it performs 3, so the resulting dataframe has 3 rows (fvalue, lvalue, mean) with nulls or actual values depending on what is stated in the dictionary. The ideal result would be one row with multindex in case of multiple aggregations (e.g. 'Column1':[fvalue, lvalue] becomes two columns, but still one row). – stackoverflowname Dec 16 '19 at 15:07
  • @stackoverflowname Can you add some sample data and expected output to the question. This will make it clear what you desire. – Scott Boston Dec 16 '19 at 15:10
  • I can't share any actual data but l can show what I mean with a random dataframe. I have updated the question accordingly. – stackoverflowname Dec 16 '19 at 15:30
  • @stackoverflowname Use untack() to_frame(). See update – Scott Boston Dec 16 '19 at 15:37
  • Apologies, I previously approved your answer but after testing it I noticed it does remove any null column already present in the data, thus not working for a dataset containing null values like mine. – stackoverflowname Dec 16 '19 at 16:11
  • If this answer helped you would you [upvote and accept](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work?answertab=votes#tab-top) – Scott Boston Dec 16 '19 at 16:18
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204356/discussion-between-stackoverflowname-and-scott-boston). – stackoverflowname Dec 16 '19 at 16:21