Writing custom pandas aggfunc without making all dtypes object

Question

I (think I) need to write a custom aggregation function for the geopandas.GeoDataFrame.dissolve() operation. When merging multiple polygons, I want to keep the information of the polygon with the largest area, that also fulfils other criteria. The operation works fine, but afterwards all attributes of my GeoDataFrame are of dtype object.

The same issue happens with regular pandas groupy(), so I have simplified the example below. Can someone tell me if I should write my custom_sort() differently, to keep the dtypes intact?

import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B'],
    'ints': [1, 2, 3, 4],
    'floats': [1.0, 2.0, 2.2, 3.2],
    'strings': ['foo', 'bar', 'baz', 'qux'],
    'bools': [True, True, True, False],
    'test': ['drop this', 'keep this', 'keep this', 'drop this'],
    })


def custom_sort(df):
    """Define custom aggregation function with special sorting."""
    df = df.sort_values(by=['bools', 'floats'], ascending=False)
    return df.iloc[0]


print(df)
print(df.dtypes)
print()
grouped = df.groupby(by='group').agg(custom_sort)
print(grouped)
print(grouped.dtypes)  # Issue: All dtypes are object
print()
print(grouped.convert_dtypes().dtypes)  # Possible solution, but not for me

# Please note that I cannot use convert_dtypes(). I actually need this for
# geopandas.GeoDataFrame.dissolve() and I think convert_dtypes() messes up
# the geometry information

Output:

  group  ints  floats strings  bools       test
0     A     1     1.0     foo   True  drop this
1     A     2     2.0     bar   True  keep this
2     B     3     2.2     baz   True  keep this
3     B     4     3.2     qux  False  drop this
group       object
ints         int64
floats     float64
strings     object
bools         bool
test        object
dtype: object

      ints floats strings bools       test
group                                     
A        2    2.0     bar  True  keep this
B        3    2.2     baz  True  keep this
ints       object
floats     object
strings    object
bools      object
test       object
dtype: object

ints         Int64
floats     Float64
strings     string
bools      boolean
test        string
dtype: object

rafaelc · Accepted Answer · 2022-06-03T15:36:58.113

1

The source of the problem is that df.iloc[0] returns a pandas series. This series has multiple values in it, with different dtypes. Automatically, pandas may convert the dtype of the series to object. If I recall correctly, this depends on the version of the pandas library you're working with. Changes have been made to this behavior over time.

The solution to your problem heavily depends on the operations you're doing in your custom agg function.

In your toy example, I would suggest manipulating your dataframe beforehand, and using the simples possible aggregating function.

For example, anticipating the complex logic gives a simple head as agg:

(df.sort_values(by=['bools', 'floats'], 
               ascending=False)
   .groupby(by='group')
   .agg('first')

For what is worth, I'd also suggest you use more recent pandas versions.

edited Jun 03 '22 at 15:36

answered Jun 03 '22 at 15:28

rafaelc

57,686
15
58
82

Thank you very much. I actually came up with the same workaround while still considering my question here. I would suggest you change your answer to ".agg('first')", since I assume that is much faster than the lambda. But can I trust that groupby does not somehow mess up the sorting? It has a "sort" argument itself, after all. I will mark this as a solution, but am still curious if there are better ways to define a custom aggfunc. I am on pandas 1.4.1, while 1.4.2 seems to be the most recent. I guess that is fine. – Azrael_DD Jun 03 '22 at 15:34
1

You are correct that `first` is faster here. I've added your suggestion :). As per the order-preserving behavior of groupby, you can take a look in more depth in this thread: https://stackoverflow.com/questions/26456125/python-pandas-is-order-preserved-when-using-groupby-and-agg . – rafaelc Jun 03 '22 at 15:38

Writing custom pandas aggfunc without making all dtypes object

1 Answers1