Count distinct over multiple columns in data.table

Question

I observe users at different times and in different situations and potentially I see them multiple times, like so:

df <- data.table(time = c(1,1,1,2,2),
                 user = c(1,1,2,1,2),
                 situation = c(1,1,1,2,2),
                 observation = c(1,2,1,1,1))

What I would like to do is to count the number of user-situations in each time period using data.table. Expected output:

result <- data.table(time = c(1,2),
                     user_situations = c(2,2))

I know I can do this in a chained way:

 unique(df[, .(time, user, situation)])[, .(user_situations = .N), .(time)]

but wonder if there's a simple way to do this in one go.

You might try `df[, .(user_situations = uniqueN(.SD[,.(user, situation)])), time]` but I think your method is more efficient. — Psidom, May 08 '17 at 18:45
Your solution looks fine to me, I would slightly modify to `unique(df, by = c("user","situation"))[, .N, by = time]` — David Arenburg, May 08 '17 at 20:21

score 5 · Answer 1 · answered May 08 '17 at 18:56

5

dplyr solution:

library(dplyr)
df <- data.table(time = c(1,1,1,2,2),
             user = c(1,1,2,1,2),
             situation = c(1,1,1,2,2),
             observation = c(1,2,1,1,1))

df %>% group_by(time) %>%
  distinct(user, situation) %>%
  summarise(user_situations = n())

# tbl_dt [2 × 2]
   time user_situation
  <dbl>          <int>
1     1              2
2     2              2

answered May 08 '17 at 18:56

user7982431

59
1

Yup, in `dplyr` this stuff is pretty easy. Unfortunately I need one for `data.table` – RoyalTS May 08 '17 at 20:02
4

How is this any easier than the `data.table` solution? This is just plain verbatism – David Arenburg May 08 '17 at 20:05
I'd like to add that this ```summarise(n_distinct(c(user, situation)))``` - which I thought would work - DOESN'T give the correct result. thanks your solution worked! – Jas Nov 05 '20 at 08:34

Count distinct over multiple columns in data.table

1 Answers1