Aggregating with length function in dplyr and data.table packages

Question

I have data.frame which I want to group my columns in dplyr and data.table packages.

So this is my data.frame:

df <- data.frame(OwnerUserId = sample(c(-1, 2:10), size = 10),
                  TypeOfId = sample(c(1,2), size = 10, replace = T))

df

## OwnerUserId TypeOfId
## 1           10        1
## 2            6        2
## 3            7        2
## 4            3        2
## 5            2        2
## 6            9        2
## 7           -1        2
## 8            5        1
## 9            8        2
## 10           4        1

I want to group with the use of length function and have a result like this:

new <- aggregate(x = df[, "OwnerUserId"],
                 by = df["OwnerUserId"], 
                 FUN = length)

##    OwnerUserId x
## 1           -1 1
## 2            2 1
## 3            3 1
## 4            4 1
## 5            5 1
## 6            6 1
## 7            7 1
## 8            8 1
## 9            9 1
## 10          10 1

I did

new <- df %>% group_by(OwnerUserId) %>% summarise(AnswersNumber = length(df[, "OwnerUserId"]))

And I got

##    OwnerUserId AnswersNumber
##          <dbl>         <int>
##  1          -1            10
##  2           2            10
##  3           3            10
##  4           4            10
##  5           5            10
##  6           6            10
##  7           7            10
##  8           8            10
##  9           9            10
## 10          10            10

And it isn't the same result.

Also I tried to do it in data.table, but I get an errors.

Maybe there are some other functions that are analogous to length function in aggregate()?

score 4 · Accepted Answer · answered May 08 '21 at 18:22

We can use n() in dplyr

library(dplyr)
new <- df %>%
    group_by(OwnerUserId) %>% 
    summarise(AnswersNumber = n())

and .N in data.table

library(data.table)
setDT(df)[, .(AnswersNumber = .N), OwnerUserId]

The df[, "OwnerUserId"] returns the entire column breaking the grouping created with 'OwnerUserId', thus we get the length of the whole column instead of the length of the group. We could instead use summarise(AnswerNumber = length(OwnerUserId)) as well. But, n() is specialized for this

Hi and thanks for your solution! It really help me – lizsav May 09 '21 at 12:56 — lizsav, May 09 '21 at 12:56

Aggregating with length function in dplyr and data.table packages

1 Answers1