1

I am trying to convert some R code(not written by me) to python code for a project. The R code uses aggregate() to do some grouped sum up, but when I try to replicate the action in Python by using .groupby, the result differs. R code yields dataframe with 479000+ rows whereas Python yields 489000+ rows.

It turns out later, I discovered that R also has this group_by() function and when used on that huge dataframe, the result is the same as what .groupby yields in Python

test <- df %>% group_by(A, B, C) %>% 
  summarise(D= sum(D, na.rm=TRUE), E= sum(E,na.rm=TRUE))


test <- aggregate(x=list(D= df$D, E= df$E), by=list(A= df$A, B= df$B, C=df$C), 
                  FUN=function(x) sum(x, na.rm=TRUE))
test= df.groupby(['A', 'B', 'C'],as_index=False)['D', 'E'].agg('sum')

I am almost surely confident that I didn't mess up on the coding, since the project I work on put emphasis on confidentiality, the above codes is the best I can provide, sorry in advance.

It is obvious that at least one of these approaches are wrong and two functions must operate in different ways to cause that. I would really wish to know the difference between them and which function is correct in achieving what I need(namely). Thank you very much.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 2
    Sounds like an interesting problem, but we would really need some example data to demonstrate the issue. When I tried it on the mtcars dataset I could not reproduce your results – Shinobi_Atobe Jul 11 '19 at 07:24
  • can you show the output of `df %>% distinct(A,B,C) %>% nrow()` – Shinobi_Atobe Jul 11 '19 at 07:29
  • 1
    @Potato_Noodle We don't want all the data anyway. A minimal self-contained [reproducible example](https://stackoverflow.com/a/5963610/6574038) is required. Cheers. – jay.sf Jul 11 '19 at 07:31
  • 4
    The first version with `group_by` includes `NA` as a grouping level for `A B C`, even though it removes `NA`s from the `sum` of `D E`. The `aggregate` version does not include `NA` grouping levels. – Andrew Gustar Jul 11 '19 at 07:36
  • I hate to say this but the problem lies within the previous steps of my code, which has nothing to do with the functions I mentioned in the question. Sorry to disappoint :( . All comments are sincerely appreciated. – Potato_Noodle Jul 11 '19 at 08:01

1 Answers1

1

It is difficult to know thee exact reason, but here is one potential issue with aggregate with na.action. By default, it can remove the rows with NA. An option is to add na.action = na.pass

Using a reproducible example

set.seed(24)
df1 <- data.frame(col1 = rep(letters[1:4], each = 4), 
     col2 =  sample(c(1:5, NA), 16, replace = TRUE), col3 =  NA)

extreme case where one column is all NA

aggregate(.~ col1, df1, sum, na.rm = TRUE)

Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate

aggregate(.~ col1, df1, sum, na.rm = TRUE, na.action = na.pass)
#  col1 col2 col3
#1    a   13    0
#2    b    5    0
#3    c   11    0
#4    d   16    0

So, depending upon the number of NAs anywhere in the columns, that row could be dropped out and this would reflect in the calculation

akrun
  • 874,273
  • 37
  • 540
  • 662