0

I noticed a strange behaviour when I update the version of pandas from 0.23.4 to 0.24.2. The following snippet demonstrates this: CSV File: (Filename: my_data_new.csv)

date,name,id,roll,sub_1,sub_2,sub_3
2016-11-30 08:00:00,AAA,A,1,123.456,123.456,123.456
2016-11-30 09:00:00,AAA,A,1,123.457,123.457,123.457
2016-11-30 10:00:00,AAA,A,1,123.458,123.458,123.458
2016-11-30 11:00:00,AAA,A,1,123.459,123.459,123.459
2016-11-30 12:00:00,BBB,B,2,123.451,123.456,123.456
2016-11-30 13:00:00,BBB,B,2,123.452,123.457,123.457
2016-11-30 14:00:00,BBB,B,2,123.453,123.458,123.458
2016-11-30 15:00:00,BBB,B,2,123.454,123.459,123.459

Snippet:

import pandas as pd

print("PANDAS-VERSION:", pd.__version__)

def my_func(d):
    d_copy = d.copy(deep=True)
    return d_copy

data = pd.read_csv("~/my_data_new.csv", parse_dates=['date'], index_col=['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)

Output: In Pandas Version-0.23.4:

PANDAS-VERSION: 0.23.4
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 08:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 09:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 10:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 11:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459

In Pandas Version-0.24.2:

PANDAS-VERSION: 0.24.2
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 12:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 13:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 14:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 15:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459

My Observations is as follows: In pandas-v0.24.2, the index of the last group df (in current case 'BBB') is being applied to all the previous group dfs (in current case 'AAA'), while in pandas-0.23.4, the previous indexes are preserved.

Is it a documented behaviour? If so, kindly point me to that modification in Release Notes/code in repo.

  • You should really update to the current version (1.1.1). This is many versions behind. This is not an issue in the current version. Perhaps this it's related to [Pandas GroupBy.apply method duplicates first group](https://stackoverflow.com/questions/21390035/). – Trenton McKinney Sep 25 '20 at 19:09
  • Thanks for the comment. But my problem is with the Index. Why is Index of the last Group DF being copied to other Group DFs? – Balaji Venkatachalam Sep 26 '20 at 02:36

1 Answers1

0

This problem is already reported here: https://github.com/pandas-dev/pandas/issues/28652 Also one more observation is that, it will happen only if the dtype of index is datetime64[ns], and is not happeing if it is obj.

data = pd.read_csv("~/my_data_new.csv")
data['date'] = pd.to_datetime(data['date'])
data = data.set_index(['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)

The result of the above will be:

PANDAS-VERSION: 0.24.2
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 12:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 13:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 14:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 15:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459

It does not happen if I do the following:

data = pd.read_csv("~/my_data_new.csv")
data = data.set_index(['date']).sort_index()
result = data.groupby('name').apply(my_func)
print(result)

The result of the above code will be:

PANDAS-VERSION: 0.24.2
                         name id  roll    sub_1    sub_2    sub_3
name date                                                        
AAA  2016-11-30 08:00:00  AAA  A     1  123.456  123.456  123.456
     2016-11-30 09:00:00  AAA  A     1  123.457  123.457  123.457
     2016-11-30 10:00:00  AAA  A     1  123.458  123.458  123.458
     2016-11-30 11:00:00  AAA  A     1  123.459  123.459  123.459
BBB  2016-11-30 12:00:00  BBB  B     2  123.451  123.456  123.456
     2016-11-30 13:00:00  BBB  B     2  123.452  123.457  123.457
     2016-11-30 14:00:00  BBB  B     2  123.453  123.458  123.458
     2016-11-30 15:00:00  BBB  B     2  123.454  123.459  123.459