pandas groupby.apply difference between 0.23.4 and 0.24.2 with deep copy

Question

I noticed a strange behaviour when I update the version of pandas from 0.23.4 to 0.24.2. The following snippet demonstrates this: CSV File: (Filename: my_data_new.csv)

date,name,id,roll,sub_1,sub_2,sub_3
2016-11-30 08:00:00,AAA,A,1,123.456,123.456,123.456
2016-11-30 09:00:00,AAA,A,1,123.457,123.457,123.457
2016-11-30 10:00:00,AAA,A,1,123.458,123.458,123.458
2016-11-30 11:00:00,AAA,A,1,123.459,123.459,123.459
2016-11-30 12:00:00,BBB,B,2,123.451,123.456,123.456
2016-11-30 13:00:00,BBB,B,2,123.452,123.457,123.457
2016-11-30 14:00:00,BBB,B,2,123.453,123.458,123.458
2016-11-30 15:00:00,BBB,B,2,123.454,123.459,123.459

Snippet:

import pandas as pd

print("PANDAS-VERSION:", pd.__version__)

def my_func(d):
    d_copy = d.copy(deep=True)
    return d_copy

data = pd.read_csv("~/my_data_new.csv", parse_dates=['date'], index_col=['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)

Output: In Pandas Version-0.23.4:

PANDAS-VERSION: 0.23.4
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 08:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 09:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 10:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 11:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459

In Pandas Version-0.24.2:

PANDAS-VERSION: 0.24.2
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 12:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 13:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 14:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 15:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459

My Observations is as follows: In pandas-v0.24.2, the index of the last group df (in current case 'BBB') is being applied to all the previous group dfs (in current case 'AAA'), while in pandas-0.23.4, the previous indexes are preserved.

Is it a documented behaviour? If so, kindly point me to that modification in Release Notes/code in repo.

You should really update to the current version (1.1.1). This is many versions behind. This is not an issue in the current version. Perhaps this it's related to [Pandas GroupBy.apply method duplicates first group](https://stackoverflow.com/questions/21390035/). — Trenton McKinney, Sep 25 '20 at 19:09
Thanks for the comment. But my problem is with the Index. Why is Index of the last Group DF being copied to other Group DFs? — Balaji Venkatachalam, Sep 26 '20 at 02:36

Balaji Venkatachalam · Answer 1 · 2020-09-27T07:59:49.827

This problem is already reported here: https://github.com/pandas-dev/pandas/issues/28652 Also one more observation is that, it will happen only if the dtype of index is datetime64[ns], and is not happeing if it is obj.

data = pd.read_csv("~/my_data_new.csv")
data['date'] = pd.to_datetime(data['date'])
data = data.set_index(['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)

The result of the above will be:

PANDAS-VERSION: 0.24.2
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 12:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 13:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 14:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 15:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459

It does not happen if I do the following:

data = pd.read_csv("~/my_data_new.csv")
data = data.set_index(['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)

The result of the above code will be:

PANDAS-VERSION: 0.24.2
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 08:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 09:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 10:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 11:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459

pandas groupby.apply difference between 0.23.4 and 0.24.2 with deep copy

1 Answers1