16

I've seen a pandasql query like this:

df = pd.DataFrame({'A': [1, 2, 2], 'B': [3, 4, 5]})
sqldf('select * from df group by A', locals())

This gives:

   A  B
0  1  3
1  2  6

I find it really weird to have a group by without an aggregate function, but can anyone tell me which function is used on the aggregated columns to reduce multiple values into one?

zoran119
  • 10,657
  • 12
  • 46
  • 88
  • By ANSI standards, that SQL statement is incorrect as non-aggregated columns must appear in `GROUP BY` clause. MySQL is only dialect I know that allows this with its only_full_group_by mode set to off. Maybe this module uses this mode and/or dialect. And no direct function reduces multiple values as `GROUP BY` runs distinct groupings values to run aggregations. No doubt the inspiration to pandas' `groupby()`. – Parfait Mar 20 '17 at 12:20
  • 1
    Where did the 6 (B1) come from? – Andrew L Mar 20 '17 at 12:21

2 Answers2

18

It looks like the groupby method you're looking for is last():

df = pd.DataFrame({'A': [1, 2, 2], 'B': [3, 4, 5]})
df.groupby('A', as_index=False).last()

Output:

   A  B
0  1  3
1  2  5

I'm saying this assuming the 5 was a typo (see my comment above) and meant to be 6.

Andrew L
  • 6,618
  • 3
  • 26
  • 30
  • 2
    I'm the `.last()` function is mentioned elsewhere, but I found it here, and it solved a question I've had for a while: if you want to do a `groupby()`, say with a `MultiIndex`, how do you just return the values that you want? – Evan Oct 18 '18 at 18:37
  • Where did 4 go in column="B"? – haneulkim Mar 07 '19 at 15:47
  • @haneulkim *Where did 4 go in column="B"?* Since it's grouping by `A`, `4` and `5` get grouped with `2`, but only `5` is kept because of the `last()` function. If you use `first()`, then it would keep the `4` instead. – wisbucky Jan 20 '22 at 00:00
1

The groupby() operation creates "groupby" objects that you can work with. The groups themselves have key/value pairs Follow this example:

#%%
import pandas as pd

# make a dataframe
df = pd.DataFrame([['foo', 'baz', 1, 9.3],
                   ['foo', 'baz', 1, 9.4],
                   ['foo', 'baz', 3, 9.5],
                   ['bar', 'bash',5, 1.7],
                   ['bar', 'bash',10, 1.8],
                   ['bar', 'junk',11, 2.3]],
                    columns=['col_a', 'col_b', 'col_c', 'col_d'])
print(df)

#%% create a groupby object; 
#  use two dimension columns & one measure column from the original df
gr = df[['col_a', 'col_b', 'col_c']].groupby(by=['col_a', 'col_b'])
print(type(gr))

#%% typical aggregate method of a groupby object
gr_sum = gr.agg({'col_c':'sum'})
print(gr_sum)

#%% look at the groups within the groupby object (dict)
print(gr.groups.keys())

#%% now that you have the keys, you can work with specific groups
print(gr.get_group(('bar', 'bash')))

#%% or you can create a dataframe of only the keys 
# (to finally answer the OP's question)
df_no_agg = pd.DataFrame(gr.groups.keys())
print(df_no_agg)

#%% simplify & put it all together in one line of code
pd.DataFrame(df.groupby(by=['col_a', 'col_b']).groups.keys())
Data-phile
  • 339
  • 2
  • 7