Group by without an aggregate function

Question

I've seen a pandasql query like this:

df = pd.DataFrame({'A': [1, 2, 2], 'B': [3, 4, 5]})
sqldf('select * from df group by A', locals())

This gives:

   A  B
0  1  3
1  2  6

I find it really weird to have a group by without an aggregate function, but can anyone tell me which function is used on the aggregated columns to reduce multiple values into one?

By ANSI standards, that SQL statement is incorrect as non-aggregated columns must appear in `GROUP BY` clause. MySQL is only dialect I know that allows this with its only_full_group_by mode set to off. Maybe this module uses this mode and/or dialect. And no direct function reduces multiple values as `GROUP BY` runs distinct groupings values to run aggregations. No doubt the inspiration to pandas' `groupby()`. — Parfait, Mar 20 '17 at 12:20

score 18 · Answer 1 · answered Mar 20 '17 at 12:24

18

It looks like the groupby method you're looking for is last():

df = pd.DataFrame({'A': [1, 2, 2], 'B': [3, 4, 5]})
df.groupby('A', as_index=False).last()

Output:

   A  B
0  1  3
1  2  5

I'm saying this assuming the 5 was a typo (see my comment above) and meant to be 6.

answered Mar 20 '17 at 12:24

Andrew L

6,618
3
26
30

2

I'm the `.last()` function is mentioned elsewhere, but I found it here, and it solved a question I've had for a while: if you want to do a `groupby()`, say with a `MultiIndex`, how do you just return the values that you want? – Evan Oct 18 '18 at 18:37
Where did 4 go in column="B"? – haneulkim Mar 07 '19 at 15:47
@haneulkim *Where did 4 go in column="B"?* Since it's grouping by `A`, `4` and `5` get grouped with `2`, but only `5` is kept because of the `last()` function. If you use `first()`, then it would keep the `4` instead. – wisbucky Jan 20 '22 at 00:00

Data-phile · Answer 2 · 2023-02-12T17:24:42.713

The groupby() operation creates "groupby" objects that you can work with. The groups themselves have key/value pairs Follow this example:

#%%
import pandas as pd

# make a dataframe
df = pd.DataFrame([['foo', 'baz', 1, 9.3],
                   ['foo', 'baz', 1, 9.4],
                   ['foo', 'baz', 3, 9.5],
                   ['bar', 'bash',5, 1.7],
                   ['bar', 'bash',10, 1.8],
                   ['bar', 'junk',11, 2.3]],
                    columns=['col_a', 'col_b', 'col_c', 'col_d'])
print(df)

#%% create a groupby object; 
#  use two dimension columns & one measure column from the original df
gr = df[['col_a', 'col_b', 'col_c']].groupby(by=['col_a', 'col_b'])
print(type(gr))

#%% typical aggregate method of a groupby object
gr_sum = gr.agg({'col_c':'sum'})
print(gr_sum)

#%% look at the groups within the groupby object (dict)
print(gr.groups.keys())

#%% now that you have the keys, you can work with specific groups
print(gr.get_group(('bar', 'bash')))

#%% or you can create a dataframe of only the keys 
# (to finally answer the OP's question)
df_no_agg = pd.DataFrame(gr.groups.keys())
print(df_no_agg)

#%% simplify & put it all together in one line of code
pd.DataFrame(df.groupby(by=['col_a', 'col_b']).groups.keys())

Group by without an aggregate function

2 Answers2