5

I try to groupby and agg but I receive an empty dataframe and no error.

When I do this:

  df_temp = df.groupby('Col1')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive the dataframe aggregated as expected.

When I do this:

  df_temp = df.groupby('Col1', 'Col2')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive the dataframe aggregated as expected.

When I do this:

  df_temp = df.groupby('Col1', 'Col2', 'Col3')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive the dataframe aggregated as expected.

But when I do this:

  df_temp = df.groupby('Col1', 'Col2', 'Col3', 'Col4')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive an empty dataframe and no error.

However, I do not think that the problem is Col4 because when I remove Col2 and I still keep Col4 then I receive the dataframe aggregated as expected.

Why this is happening?

'Col1', 'Col2', 'Col3', 'Col4' are of different types but I do not think that this is the problem because for example also Col1', 'Col2', 'Col3' are of different types but the aggregation works when I group by only on these.

Can it be related to NAs in these columns?

P.S.

I know that it would better to have specific examples of my data but it would be too time-consuming to post them here and also I do not want to expose my data at all.

P.S.2

I did the following. Before the groupby, I filled in the np.nan with values (eg -1 for floats and 'NA' for objects) and the code worked so I was probably right at my initial hypothesis about the NAs. Feel free to share ideas why this is happening.

Outcast
  • 4,967
  • 5
  • 44
  • 99
  • can you add your input and expected dataframe please? please see [mcve] – Umar.H Jun 10 '20 at 12:33
  • `columns_not_group` can't have any NA values - this must be a list of the column **names**, not the columns itself. You should check how you created `columns_not_group`. – Stef Jun 10 '20 at 15:48
  • @Datanovice, it is a bit time-consuming to do so (without also exposing too much my data). I think that if somebody is experienced in Pandas then he/she may suggest some good hypotheses on why this is happening above (and with no error returned). I suspect that it has to something with NA values in the column values corresponding to columns_not_group but I may be wrong – Outcast Jun 12 '20 at 12:06
  • @Stef, I meant column values corresponding to columns_not_group - obviously the columns_not_group cannot have any NAs. – Outcast Jun 12 '20 at 12:06
  • @Datanovice, I did the following. Before the `groupby`, I filled in the `np.nan` with values (eg -1 for floats and 'NA' for objects) and the code worked so I was probably right at my initial hypothesis about the NAs. Do you have any idea why this is happening? – Outcast Jun 12 '20 at 13:43
  • @Stef, I did the following. Before the `groupby`, I filled in the `np.nan` with values (eg -1 for floats and 'NA' for objects) and the code worked so I was probably right at my initial hypothesis about the NAs. Do you have any idea why this is happening? – Outcast Jun 12 '20 at 13:44
  • It's still quite difficult to say without a reproduction of your data (even if its a few rows) if you posted this to github as an issue I'm pretty sure you'd receive the same response, I think the lack of response is also quite evident of the above. Maybe add a bounty someone more experienced may be able to help but in the first instance just add a few rows of data that can reproduce your issue. – Umar.H Jun 12 '20 at 13:57
  • @Datanovice, if it is few rows (which still exposes my data though) then you may not encounter the problem at all. I think that someone who is experienced in `pandas` would have encountered something similar and can instantly tell. – Outcast Jun 12 '20 at 13:58
  • you can create dummy data or randomize it – Umar.H Jun 12 '20 at 13:58
  • @Datanovice, if it is dummy then you can simply create them by yourself in the end. And in any case, I think that someone who is experienced in pandas would have encountered something similar and can instantly tell. – Outcast Jun 12 '20 at 13:59
  • @jezrael, do you have any ideas with your very experienced `pandas` mind about why what is described at my post occurs? :) – Outcast Jun 12 '20 at 14:03
  • @Datanovice, answer given and with no data example ;) – Outcast Jun 12 '20 at 14:23
  • my only response is to accept the common community wisdom here [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Umar.H Jun 12 '20 at 14:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215824/discussion-between-outcast-and-datanovice). – Outcast Jun 12 '20 at 14:32

1 Answers1

9

The reason is that in all groups created by all 4 columns is at least one NA value. Therefore these groups are excluded and the result is empty. If you take less than 4 columns this condition is obviously not met for your actual data.

See the docs on missing values:

NA groups in GroupBy are automatically excluded.

Example:

>>> df = pd.DataFrame({'a':[None,1,2], 'b':[1,None,2], 'c': [1,2,None], 'd': [1,1,1]})
>>> df
     a    b    c  d
0  NaN  1.0  1.0  1
1  1.0  NaN  2.0  1
2  2.0  2.0  NaN  1
>>> df.groupby(['a', 'b']).d.sum()
a    b  
2.0  2.0    1
Name: d, dtype: int64
>>> df.groupby(['a', 'c']).d.sum()
a    c  
1.0  2.0    1
Name: d, dtype: int64
>>> df.groupby(['b', 'c']).d.sum()
b    c  
1.0  1.0    1
Name: d, dtype: int64
>>> df.groupby(['a', 'b', 'c']).d.sum()
Series([], Name: d, dtype: int64)

Version 1.1.0 will have a dropna parameter in groupby to handle this kind of cases. You can set it to False to include NA values in groupby keys (default is True for backward compability), see https://github.com/pandas-dev/pandas/pull/30584.

Josiah Yoder
  • 3,321
  • 4
  • 40
  • 58
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Ok so if this is true then it is what I was saying basically ;) Or ok let's say generally suspecting - I did not explicitly given a specific explanation. – Outcast Jun 12 '20 at 14:11
  • so you'll have to wait for 1.1.0 and for the time being fill your NAs, see me updated answer. – Stef Jun 12 '20 at 14:27
  • Ok I see thank you. :) Am I wrong to feel that it is quite "weird" that pandas by default ignores every `groupby` row that simply has an NA in one of its columns (and also `pandas` does not have the option to change that)? I do not see what is the problem with preserving these rows as they are - either conceptually or technically. – Outcast Jun 12 '20 at 14:31
  • I think this is discussed at length here: https://github.com/pandas-dev/pandas/issues/3729 (this issue was opened in 2013!) – Stef Jun 12 '20 at 14:35
  • Ok, I see, thanks (upvote) ;) - and indeeds pandas is a bit too silent with regards to something like that which is quite major (although as you mention this is somewhere written in the documentation). – Outcast Jun 12 '20 at 14:40