3

I have a data set which contains a number of unique identifiers for each date, e.g.

df <- data.frame(date = as.Date(c("2016-01-01", "2016-01-01", "2016-01-02", "2016-01-02")), 
ids = c(3, 4, 1, 3))

I'd then like to summarise this information to get the number of new unique ids that appear on the current date. For example, on January 1 there are two uniques ids (3 and 4). But on January 2, there is only one new unique id (1). So the resulting data frame should look like:

date          n_new_unique_ids
2016-01-01                   2
2016-01-02                   1

Is this possible with dplyr? I had a look at lag but a fixed lag size doesn't make sense in this context. Or perhaps with another package?

jkeirstead
  • 2,881
  • 3
  • 23
  • 26

2 Answers2

2

One option would be to remove all the duplicated 'ids' from the dataset

df %>%
  filter(!(duplicated(ids)|duplicated(ids, fromLast=TRUE)))
#        date ids
#1 2016-01-01   2
#2 2016-01-02   3

Update

Using the updated data

df %>%
    arrange(date, ids) %>% 
    filter(!duplicated(ids)) %>% 
    group_by(date) %>% 
    summarise(n_unique_ids = n())
 #       date n_unique_ids
 #      <date>        <int>
 #1 2016-01-01            2
 #2 2016-01-02            1
akrun
  • 874,273
  • 37
  • 540
  • 662
0

One way using aggregate. We can keep only those rows which has unique id's and then find it's length.

aggregate(ids~date, df[!duplicated(df$ids),], length)

#     date     ids
#1 2016-01-01   2
#2 2016-01-02   1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213