0

Grouping by multiple columns with missing data:

data = [['Falcon', 'Captive', 390], ['Falcon', None, 350],
        ['Parrot', 'Captive', 30], ['Parrot', 'Wild', 20]]
df = pd.DataFrame(data, columns = ['Animal', 'Type', 'Max Speed'])

I understand how missing data are dealt with when grouping by individual columns (groupby columns with NaN (missing) values), but do not understand the behaviour when grouping by two columns.

It seems I cannot loop over all groups even though they seem to identified:

groupeddf = df.groupby(['Animal', 'Type'])
counter = 0
for group in groupeddf:
    counter = counter + 1
print(counter)
len(groupeddf.groups)

results in 3 and 4 which is not consistent.

Pandas version 1.0.3

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
wpm
  • 77
  • 1
  • 8

2 Answers2

1

To loop over all groups in pandas 1.0 you'll need to convert the NoneType objects to strings.

df = df.astype(str) # or just df['Type'] = df['Type'].astype(str) 

Then you'll get four iterations of your loop.

According to the docs:

NA and NaT group handling

If there are any NaN or NaT values in the grouping key, these will be automatically excluded. In other words, there will never be an “NA group” or “NaT group”. This was not the case in older versions of pandas, but users were generally discarding the NA group anyway (and supporting it was an implementation headache).

Or you could upgrade to the dev pandas 1.1, where this issue appears to be fixed with the option dropna=False

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
1

In the post concerning groupby columns with NaN (missing) values there is a sentence: NA groups in GroupBy are automatically excluded.

Apparently, in case of grouping by multiple columns, the same occurs if any level of grouping key contains NaN.

To confirm it, run:

for key, grp in groupeddf:
    print(f'\nGroup: {key}\n{grp}')

and the result will be:

Group: ('Falcon', 'Captive')
   Animal     Type  Max Speed
0  Falcon  Captive        390

Group: ('Parrot', 'Captive')
   Animal     Type  Max Speed
2  Parrot  Captive         30

Group: ('Parrot', 'Wild')
   Animal  Type  Max Speed
3  Parrot  Wild         20

But if you execute groupeddf.groups (to print the content), you will get:

{('Falcon', 'Captive'): Int64Index([0], dtype='int64'),
 ('Falcon', nan): Int64Index([1], dtype='int64'),
 ('Parrot', 'Captive'): Int64Index([2], dtype='int64'),
 ('Parrot', 'Wild'): Int64Index([3], dtype='int64')}

So we have group ('Falcon', nan), containing row with index 1.

If you want to process all groups, without any tricks to change NaN into something other, run something like:

for key in groupeddf.groups:
    print(f'\nGroup: {key}\n{df.loc[groupeddf.groups[key]]}')

This time the printout will contain also the previously missing group.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • So, it is excluded, but not really excluded? I understand the workaround(s), but not the DataFrameGroupBy object. – wpm May 27 '20 at 12:23
  • 1
    The above exclusion pertains only to how "iterator access" (*for* loop) works on the *GrouBy* object itself. As I just found, it does **not** pertain to iterator access to *GrouBy.groups*. – Valdi_Bo May 27 '20 at 12:31