R count distinct elements based on two columns by group

Question

I have this data basically, but larger:

I want to count a number of distinct combinations of (customer_id, account_id) - that is, distinct or unique values based on two columns, but for each start_date. I can't find the solution anywhere. The result should be another column added to my data.table that should look like this:

That is, for each start_date, it calculates number of distinct values based on both customer_id and account_id.

For example, for start_date equal to 2.2.2018, I have distinct combinations in (customer_id,account_id) being (4,22) (5,38) and (6,13), so I want count to be equal to 3 because I have 3 distinct combinations. I also need the solution to work with character values in customer_id and account_id columns.

Code to replicate the data:

customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-02","%Y-%m-%d"),each=4))

data <- data.table(customer_id,account_id,start_date)

Please don't post images of data, use `head(dput(...))` to post the actual data. — smci, Aug 16 '18 at 01:41
The R terminology would be *"Add count column showing, for each start_date, count of distinct combinations of (customer_id, account_id)"* — smci, Aug 16 '18 at 01:43
A duplicate, at least for the data.table approach: [Count number of unique rows based on two columns, by group](https://stackoverflow.com/questions/36233318/count-number-of-unique-rows-based-on-two-columns-by-group), using `DT[, count:=uniqueN(.SD) ...` — smci, Aug 16 '18 at 01:48

score 2 · Answer 1 · answered Aug 16 '18 at 00:05

Another dplyr option:

library(dplyr)
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02- 
02","%Y-%m-%d"),each=4))

data <- data.frame(customer_id,account_id,start_date)

data %>%
  group_by(start_date)%>%
  mutate(distinct_values = n_distinct(customer_id, account_id)) %>%
  ungroup()

score 1 · Answer 2 · answered Aug 16 '18 at 00:03

dplyr option

customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02- 
02","%Y-%m-%d"),each=4))

data <- data.frame(customer_id,account_id,start_date)

data %>% 
  group_by(start_date, customer_id, account_id) %>% 
  summarise(Total = 1) %>% 
  group_by(start_date) %>% 
  summarise(Count =n())

Maurits Evers · Accepted Answer · 2018-08-16T00:12:13.990

0

Here is a data.table option

data[, N := uniqueN(paste0(customer_id, account_id, "_")), by = start_date]
#    customer_id account_id start_date N
# 1:           1         11 2017-01-01 4
# 2:           1         11 2017-01-01 4
# 3:           1         11 2017-01-01 4
# 4:           2         11 2017-01-01 4
# 5:           3         55 2017-01-01 4
# 6:           3         88 2017-01-01 4
# 7:           4         22 2018-02-02 3
# 8:           5         38 2018-02-02 3
# 9:           5         38 2018-02-02 3
#10:           6         13 2018-02-02 3

Or

data[, N := uniqueN(.SD, by = c("customer_id", "account_id")), by = start_date]

edited Aug 16 '18 at 00:12

answered Aug 16 '18 at 00:05

Maurits Evers

49,617
4
47
68

Thank you very much, this works great! Can I just ask why "_" is needed in paste0 function? – doremi Aug 16 '18 at 06:07
@doremi In the first solution, we concatenate entries from `customer_id` and `account_id`; separating entries by `"_"` ensures that we can distinguish `"1_11"` from `"11_1"`. Without `"_"` both would become `"111"`. I hope that helps clearing things up:-) – Maurits Evers Aug 16 '18 at 08:50

R count distinct elements based on two columns by group

3 Answers3

Linked