2

I have data.frame which I want to group my columns in dplyr and data.table packages.

So this is my data.frame:

df <- data.frame(OwnerUserId = sample(c(-1, 2:10), size = 10),
                  TypeOfId = sample(c(1,2), size = 10, replace = T))

df

## OwnerUserId TypeOfId
## 1           10        1
## 2            6        2
## 3            7        2
## 4            3        2
## 5            2        2
## 6            9        2
## 7           -1        2
## 8            5        1
## 9            8        2
## 10           4        1

I want to group with the use of length function and have a result like this:

new <- aggregate(x = df[, "OwnerUserId"],
                 by = df["OwnerUserId"], 
                 FUN = length)

##    OwnerUserId x
## 1           -1 1
## 2            2 1
## 3            3 1
## 4            4 1
## 5            5 1
## 6            6 1
## 7            7 1
## 8            8 1
## 9            9 1
## 10          10 1

I did

new <- df %>% group_by(OwnerUserId) %>% summarise(AnswersNumber = length(df[, "OwnerUserId"]))

And I got

##    OwnerUserId AnswersNumber
##          <dbl>         <int>
##  1          -1            10
##  2           2            10
##  3           3            10
##  4           4            10
##  5           5            10
##  6           6            10
##  7           7            10
##  8           8            10
##  9           9            10
## 10          10            10

And it isn't the same result.

Also I tried to do it in data.table, but I get an errors.

Maybe there are some other functions that are analogous to length function in aggregate()?

lizsav
  • 75
  • 4

1 Answers1

4

We can use n() in dplyr

library(dplyr)
new <- df %>%
    group_by(OwnerUserId) %>% 
    summarise(AnswersNumber = n())

and .N in data.table

library(data.table)
setDT(df)[, .(AnswersNumber = .N), OwnerUserId]

The df[, "OwnerUserId"] returns the entire column breaking the grouping created with 'OwnerUserId', thus we get the length of the whole column instead of the length of the group. We could instead use summarise(AnswerNumber = length(OwnerUserId)) as well. But, n() is specialized for this

akrun
  • 874,273
  • 37
  • 540
  • 662