0

I have a pandas DataFrameGroupBy object which I would like to cast to a normal dataframe. Now I know that I can use:

df_g.apply(lambda x: x)

But why is that needed? Considering that apply's are typically expensive in Pandas. Is there a better solution? I don't see any eye-watering performance penalties for my test case (since I don't have too many columns) so it might be fine. Just curious :)

Sample code:

import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5], 'b': ["a", "a", "a", "b", "b"]})
df_g = df.groupby(by='b')
df_again = df_g.apply(lambda x:x)

Regards, Niklas

Niklas B
  • 1,839
  • 18
  • 36
  • 2
    There is no straight forward way because you always *want* to do something after grouping. What's the purpose of it otherwise? – yatu Apr 21 '20 at 13:38
  • In my case it's in a compute api service, where we chain pandas operations. For almost all cases we do run an operation on it, but in a few cases we don't. Now we can either handle these corner cases by change the chaining to understand that we shouldn't run the group step, or allow it to group and then simply cast it back for the edge cases :) – Niklas B Apr 22 '20 at 10:07

1 Answers1

0

First we have to re-consider why do we group when we dont do anything on the group. we can try concat which should be faster than apply:

pd.concat(dict(iter(df_g)).values())

   a  b
0  1  a
1  2  a
2  3  a
3  4  b
4  5  b

%timeit pd.concat(dict(iter(df_g)).values())
#3.09 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df_g.apply(lambda x: x)
#5.33 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
anky
  • 74,114
  • 11
  • 41
  • 70