4

Edit: Added defT

Does using pandas.cut change the structure of a pandas.DataFrame.

I am using pandas.cut in the following manner to map single age years to age groups and then aggregating afterwards. However, the aggregation does not work as I end up with NaN in all columns that are being aggregated. Here is my code:

cutoff = numpy.hstack([numpy.array(defT.MinAge[0]),   defT.MaxAge.values])
labels = defT.AgeGrp

df['ageGrp'] = pandas.cut(df.Age, 
                          bins              = cutoff, 
                          labels            = labels, 
                          include_lowest    = True)

Here is defT:

AgeGrp  MaxAge  MinAge
   1      18      14
   2      21      19
   3      24      22
   4      34      25
   5      44      35
   6      54      45
   7      65      55

Then I pass the data-frame into another function to aggregate:

grouped = df.groupby(['Year', 'Month', 'OccID', 'ageGrp', 'Sex', \
                      'Race', 'Hisp', 'Educ'], 
                      as_index = False)

final   = grouped.aggregate(numpy.sum)

If I change the ages to age groups via this manner it works perfectly:

df['ageGrp'] = 1
df.ix[(df.Age >= 14) & (df.Age <= 18), 'ageGrp'] = 1 # Age 16 - 20
df.ix[(df.Age >= 19) & (df.Age <= 21), 'ageGrp'] = 2 # Age 21 - 25  
df.ix[(df.Age >= 22) & (df.Age <= 24), 'ageGrp'] = 3 # Age 26 - 44  
df.ix[(df.Age >= 25) & (df.Age <= 34), 'ageGrp'] = 4 # Age 45 - 64  
df.ix[(df.Age >= 35) & (df.Age <= 44), 'ageGrp'] = 5 # Age 64 - 85  
df.ix[(df.Age >= 45) & (df.Age <= 54), 'ageGrp'] = 6 # Age 64 - 85  
df.ix[(df.Age >= 55) & (df.Age <= 64), 'ageGrp'] = 7 # Age 64 - 85  
df.ix[df.Age >= 65, 'ageGrp'] = 8 # Age 85+

I would prefer to do this on the fly, importing the definition table and using pandas.cut, instead of being hard-coded.

Thank you in advance.

jhoepken
  • 1,842
  • 3
  • 17
  • 24
j riot
  • 544
  • 3
  • 6
  • 16
  • You might want to post a small DataFrame to go with that. Without that, it's a bit of a guess. – Ami Tavory Jul 14 '15 at 21:29
  • Added in the requested data.frame. – j riot Jul 14 '15 at 22:08
  • @jriot can you try `df['ageGrp'].isnull().sum()` to see whether the output is zero or not? I suspect that there are some ages in `age` column that falls outside the bin ranges, so `pd.cut` gives `NaN` to those values and this might cause problems when you do `groupby`. If you work that mannually, `df['ageGrp'] = 1` prevents such situation. – Jianxun Li Jul 14 '15 at 22:12
  • @JianxunLi Yeah, as coded Age >= 65 will mean ageGrp = nan but it sounds like he is saying all are nan? – JohnE Jul 14 '15 at 22:19
  • @JohnE the `NaN` in categorical variable is the only thing I can think of that might cause this strange behaviours (not sure about all `NaNs`). :-) Otherwise, `pd.cut` just give standard cat variables and should have no problem dealing with `groupby`. – Jianxun Li Jul 14 '15 at 22:24
  • A simplified version of the above code works fine for me in pandas 0.16.1 with sample data like `df = pd.DataFrame({ 'Age':np.random.randint(10,90,100) })` (although it will ignore ages above 65, as already noted.) Check your data including all the other variables 'year', 'month', etc. – JohnE Jul 14 '15 at 22:34
  • I tried df['ageGrp'] = 1, which did not solve the problem. I also looked at df['ageGrp'].isnull().sum() which was equal to 0. I will have a look at categorical variables. Thank you for the input. – j riot Jul 14 '15 at 23:56

1 Answers1

2

Here is, perhaps, a work-around.

Consider the following example which replicates the symptom you describe:

import numpy as np
import pandas as pd
np.random.seed(2015)

defT = pd.DataFrame({'AgeGrp': [1, 2, 3, 4, 5, 6, 7],
                     'MaxAge': [18, 21, 24, 34, 44, 54, 65],
                     'MinAge': [14, 19, 22, 25, 35, 45, 55]})

cutoff = np.hstack([np.array(defT['MinAge'][0]), defT['MaxAge'].values])
labels = defT['AgeGrp']

N = 50
df = pd.DataFrame(np.random.randint(100, size=(N,2)), columns=['Age', 'Year'])
df['ageGrp'] = pd.cut(df['Age'], bins=cutoff, labels=labels, include_lowest=True)

grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)
print(final)
#              Year  ageGrp  Age
# Year ageGrp                   
# 3    1        NaN     NaN  NaN
#      2        NaN     NaN  NaN
# ...
# 97   1        NaN     NaN  NaN
#      2        NaN     NaN  NaN
# [294 rows x 3 columns]

If we change

grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)

to

grouped = df.groupby(['Year', 'ageGrp'], as_index=True)
final = grouped.agg(np.sum).dropna()
print(final)

then we obtain:

             Age
Year ageGrp     
6    7        61
16   4        32
18   1        34
25   3        23
28   5        39
34   7        60
35   5        42
38   4        25
40   2        19
53   7        59
56   4        25
     5        35
66   6        54
67   7        55
70   7        56
73   6        51
80   5        36
81   6        46
85   5        38
90   7        58
97   1        18
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks! This worked perfectly. I am not sure why though, as I did df.dropna() earlier in the process with no avail. – j riot Jul 15 '15 at 16:47