Sliding groups in dplyr

Question

I have a data set which contains a number of unique identifiers for each date, e.g.

df <- data.frame(date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-02", "2016-01-02")), 
ids = c(3, 4, 1, 3))

I'd then like to summarise this information to get the number of new unique ids that appear on the current date. For example, on January 1 there are two uniques ids (3 and 4). But on January 2, there is only one new unique id (1). So the resulting data frame should look like:

date          n_new_unique_ids
2016-01-01                   2
2016-01-02                   1

Is this possible with dplyr? I had a look at lag but a fixed lag size doesn't make sense in this context. Or perhaps with another package?

akrun · Accepted Answer · 2016-07-15T08:59:36.043

2

One option would be to remove all the duplicated 'ids' from the dataset

df %>%
  filter(!(duplicated(ids)|duplicated(ids, fromLast=TRUE)))
#        date ids
#1 2016-01-01   2
#2 2016-01-02   3

Update

Using the updated data

df %>%
    arrange(date, ids) %>% 
    filter(!duplicated(ids)) %>% 
    group_by(date) %>% 
    summarise(n_unique_ids = n())
 #       date n_unique_ids
 #      <date>        <int>
 #1 2016-01-01            2
 #2 2016-01-02            1

edited Jul 15 '16 at 08:59

answered Jul 15 '16 at 08:45

akrun

874,273
37
540
662

I've edited the question to clarify. – jkeirstead Jul 15 '16 at 08:56
@jkeirstead I updated the post, perhaps that helps – akrun Jul 15 '16 at 09:01
1

Using `distinct(ids, .keep_all = TRUE)` should be much faster than `filter(!duplicated(ids))`. – Axeman Feb 27 '17 at 13:27

score 0 · Answer 2 · answered Jul 15 '16 at 09:02

0

One way using aggregate. We can keep only those rows which has unique id's and then find it's length.

aggregate(ids~date, df[!duplicated(df$ids),], length)

#     date     ids
#1 2016-01-01   2
#2 2016-01-02   1

answered Jul 15 '16 at 09:02

Ronak Shah

377,200
20
156
213

Sliding groups in dplyr

2 Answers2

Update