Pandas groupby is duplicating groups when using apply twice

Question

Can pandas groupby use groupby.apply(func) and inside the func use another instance of .apply() without duplicating and overwriting data?

In a way, the use of .apply() is nested.

Python 3.7.3 pandas==0.25.1

import pandas as pd


def dummy_func_nested(row):
    row['new_col_2'] = row['value'] * -1
    return row


def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group


def pandas_groupby():
    # initialize data
    df = pd.DataFrame([
        {'country': 'US', 'value': 100.00, 'id': 'a'},
        {'country': 'US', 'value': 95.00, 'id': 'b'},
        {'country': 'CA', 'value': 56.00, 'id': 'y'},
        {'country': 'CA', 'value': 40.00, 'id': 'z'},
    ])

    # group by country and apply first dummy_func
    new_df = df.groupby('country').apply(dummy_func)

    # new_df and df should have the same list of countries
    assert new_df['country'].tolist() == df['country'].tolist()
    print(df)


if __name__ == '__main__':
    pandas_groupby()

The above code should return

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  b      None      -95.0
2      CA   56.0  y      None      -56.0
3      CA   40.0  z      None      -40.0

However, the code returns

  country  value id new_col_1  new_col_2
0      US  100.0  a      None     -100.0
1      US   95.0  a      None      -95.0
2      US   56.0  a      None      -56.0
3      US   40.0  a      None      -40.0

This behavior only appears to happen when both groups have an equal amount of rows. If one group has more rows, then the output is as expected.

U13-Forward · Answer 1 · 2020-01-24T03:00:50.597

2

A quote from the documentation:

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.

Try changing the below code in your code:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = df_group.apply(dummy_func_nested, axis=1)

    return df_group

To:

def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = dummy_func_nested(df_group)

    return df_group

You don't need the apply.

Of course, the more efficient way would be:

df['new_col_1'] = None
df['new_col_2'] = -df['value']
print(df)

Or:

print(df.assign(new_col_1=None, new_col_2=-df['value']))

edited Jan 24 '20 at 03:00

answered Jan 24 '20 at 02:49

U13-Forward

69,221
14
89
114

I remember reading that in the documentation, and immediately forgetting it, good catch! – Kaj Jan 24 '20 at 03:05
using `dummy_func_nested(df_group)` instead of `df_group.apply(dummy_func_nested, axis=1)` causes this output https://gist.github.com/olehdubno/0bc25b11c1efe3dd83b955a39f2422f7 The output shows the original groups, however the rows are still overwritten by duplicates of the other group. – Oleh Dubno Jan 24 '20 at 03:14
1

I ran the code and seem to get your desired output, can't seem to reproduce your output. What are you doing to get that output? – Kaj Jan 24 '20 at 03:53
Using `dummy_func_nested()` independent of `apply()`, i.e. `df_group = dummy_func_nested(df_group)`, produces the desired results. Not sure what I ran that caused the frame to display in a weird way. My understanding is, that we should not be calling `apply()` methods inside of other `apply()` methods that are using groupby. We should apply the function directly not using apply. – Oleh Dubno Jan 24 '20 at 19:16
When using `groupby` we should avoid using `apply()` methods inside of functions that use `apply()`. – Oleh Dubno Jan 24 '20 at 19:20
1

@OlehDubno please accept and upvote of it works :-) – U13-Forward Jan 25 '20 at 02:48

score 0 · Answer 2 · answered Jan 24 '20 at 19:24

When using groupby we should avoid using apply() methods inside of functions that use apply()

The correct code that produces desired results is below.

Disclaimer: the code could be written more efficiently. The purpose is to demonstrate that we should avoid calling apply() methods inside of groupby.apply(). It has adverse affects if the groups that we're applying it to have an equal amount of rows in each group. If the number of rows in each group is not equal, everything goes smoothly. Again, this only happens when groups have an equal amount of rows.

Shoutout to user: u10-forward

import pandas as pd


def dummy_func_nested(df):
    df['new_col_2'] = df['value'] * -1
    return df


def dummy_func(df_group):
    df_group['new_col_1'] = None

    # apply dummy_func_nested
    df_group = dummy_func_nested(df_group)

    return df_group


def pandas_groupby():
    # initialize data
    df = pd.DataFrame([
        {'country': 'US', 'value': 100.00, 'id': 'a'},
        {'country': 'US', 'value': 95.00, 'id': 'b'},
        {'country': 'CA', 'value': 56.00, 'id': 'y'},
        {'country': 'CA', 'value': 40.00, 'id': 'z'},
    ])

    # group by country and apply first dummy_func
    new_df = df.groupby('country').apply(dummy_func)

    # new_df and df should have the same list of countries
    assert new_df['country'].tolist() == df['country'].tolist()
    print(df)


if __name__ == '__main__':
    pandas_groupby()

That said, I still think it is a bug, not being able to call apply() methods inside of groupby.apply().

Pandas groupby is duplicating groups when using apply twice

2 Answers2