1

I'm trying to count how often an id appears more than once on the same day in my data.

I have a column of dates and column of personal ids.

set.seed(42) 
n     <- 10^5 
ids   <- sample(1:5000, n, replace=T)
dates <- as.Date("1943-07-13", "%Y-%m-%d") - sample(1:9000, n, replace=T)

The solution I have come up with:

sum(table(ids, dates)>1)

The problem is (lack of) efficiency. On a bigger data set I get an error message.

s_baldur
  • 29,441
  • 4
  • 36
  • 69

1 Answers1

2

We can use data.table. Create a data.table with 'ids', and 'dates', get the number of rows grouped by those columns, check whether that is greater than 1, extract that column and get the sum.

sum(data.table(ids, dates)[, .N > 1, .(ids, dates)]$V1)

Benchmarks

system.time(sum(data.table(ids, dates)[, .N > 1, .(ids, dates)]$V1))
#   user  system elapsed 
#  0.05    0.00    0.05 
system.time(sum(table(ids, dates)>1))
#   user  system elapsed 
#   0.87    0.19    1.07 
akrun
  • 874,273
  • 37
  • 540
  • 662