4

Example

set.seed(2016)
dt <- data.table(
  Grp = sample(1000, 1000000, replace=TRUE),
  Date = as.Date("2016-1-1") + sample(365, 1000000, replace=TRUE)
)
dt <- unique(dt)

dt[order(Grp, Date)]
         Grp       Date
     1:    1 2016-01-02
     2:    1 2016-01-03
     3:    1 2016-01-05
     4:    1 2016-01-06
     5:    1 2016-01-07
    ---                
341526: 1000 2016-12-27
341527: 1000 2016-12-28
341528: 1000 2016-12-29
341529: 1000 2016-12-30
341530: 1000 2016-12-31

How do I know which (if any) groups share the exact same set of dates? I guess I can dcast the data and then search for matching rows, but is there a better way?

Ben
  • 20,038
  • 30
  • 112
  • 189
  • Using dcast and then duplicated on the date columns should do it. – Haboryme Sep 10 '16 at 06:21
  • That's what I did and it worked. I'm just wondering if there's a more efficient way of doing this, in case the cardinality of Date were very high. – Ben Sep 10 '16 at 06:23
  • http://stackoverflow.com/questions/19392332/find-all-duplicated-records-in-data-table-not-all-but-one The answers provided here might interest you then. – Haboryme Sep 10 '16 at 06:40
  • Your example contains zero instances of the feature you're asking about: `dcast(dt, Grp ~ Date, fun=function(x) length(x) > 0)[, uniqueN(.SD), .SDcols=!"Grp"]` Not sure what you have in mind to do with such a result anyways. – Frank Sep 10 '16 at 08:48

1 Answers1

0

As @Frank points out, your example doesn't actually include any examples.

I think probably your best bet is something like the following:

dt_agg = dt[order(Date), .(dates = list(Date), count = .N), by = Grp]
setkey(dt_agg, count)

dt_agg[ , if (.N > 1L && any(duplicated(setDT(transpose(dates))))) .SD, 
        by = count]

The key efficiency (I think) is not to bother running comparison unless two Grps have the same number of unique Dates in the first place (hence grouping by count).

Within count, actually figuring out which Grps contain set-identical Dates I'm a bit less confident on (also the code as is doesn't exactly pull out which Grps are identical, but I don't think that's a far stretch from the code presented here.

Anyway this code ran on my machine almost instantly, so it's hard to evaluate efficiency.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198