Counting duplicates in R

Question

I would like to know how it would be possible to make a new variable counting how many ID duplicates I have for certain years. For example, below I want to count for the year 2014 how many times before that year that ID was repeated. That way in the year 2015 it is counting the ID's in both 2013 and 2014.

ID   Term   Year    Repeats
122   L      2013     N/A
112   L      2013     N/A   
002   L      2013     N/A   
152   L      2013     N/A
124   L      2013     N/A
122   L      2014     1
102   L      2014     N/A
142   L      2014     N/A
152   L      2014     N/A
120   L      2014     N/A
198   L      2014     N/A
122   L      2015     2
012   L      2015     N/A
101   L      2015     N/A
092   L      2015     N/A
031   L      2015     N/A

It's better to `dput` the example data or provide the R code that creates it, so that people who would like to help you can quickly copy your data into their own R session — Hack-R, Jul 12 '16 at 17:11

score 2 · Answer 1 · answered Jul 12 '16 at 17:34

If Year is in ascending order:

df$Repeats <- 0L
i <- which(duplicated(df$ID))
df$Repeats[i] <- with(df[i, ], unsplit(lapply(split(ID, ID), seq_along), ID))
df
#    ID Term Year Repeats
#1  122    L 2013       0
#2  112    L 2013       0
#3    2    L 2013       0
#4  152    L 2013       0
#5  124    L 2013       0
#6  122    L 2014       1
#7  102    L 2014       0
#8  142    L 2014       0
#9  152    L 2014       1
#10 120    L 2014       0
#11 198    L 2014       0
#12 122    L 2015       2
#13  12    L 2015       0
#14 101    L 2015       0
#15  92    L 2015       0
#16  31    L 2015       0

Jaap · Answer 2 · 2016-07-12T19:13:12.433

Another base R solution:

d$Repeats <- ave(d$ID, d$ID, FUN = function(x) seq_along(x)-1)

# or a bit cleaner (thx to @DavidArenburg):
d$Repeats <- with(d, ave(ID, ID, FUN = seq_along)) - 1

which gives:

> d
    ID Term Year Repeats
1  122    L 2013       0
2  112    L 2013       0
3    2    L 2013       0
4  152    L 2013       0
5  124    L 2013       0
6  122    L 2014       1
7  102    L 2014       0
8  142    L 2014       0
9  152    L 2014       1
10 120    L 2014       0
11 198    L 2014       0
12 122    L 2015       2
13  12    L 2015       0
14 101    L 2015       0
15  92    L 2015       0
16  31    L 2015       0

A solution using data.table:

library(data.table)
setDT(d, key = c('ID','Year'))
d[, Repeats := 0:(.N-1), by = ID]

which gives:

> d
     ID Term Year Repeats
 1:   2    L 2013       0
 2:  12    L 2015       0
 3:  31    L 2015       0
 4:  92    L 2015       0
 5: 101    L 2015       0
 6: 102    L 2014       0
 7: 112    L 2013       0
 8: 120    L 2014       0
 9: 122    L 2013       0
10: 122    L 2014       1
11: 122    L 2015       2
12: 124    L 2013       0
13: 142    L 2014       0
14: 152    L 2013       0
15: 152    L 2014       1
16: 198    L 2014       0

Alternatively, you can use the rowid function from the development version of data.table:

d[, Repeats := rowid(ID)-1]

With dplyr:

library(dplyr)
d %>% group_by(ID) %>% mutate(Repeats = row_number()-1)

If you want NA's instead of zero's, you could use:

d[, Repeats := c(NA, 1:(.N-1)), by = ID]

which will give:

    ID Term Year Repeats
 1:   2    L 2013      NA
 2:  12    L 2015      NA
 3:  31    L 2015      NA
 4:  92    L 2015      NA
 5: 101    L 2015      NA
 6: 102    L 2014      NA
 7: 112    L 2013      NA
 8: 120    L 2014      NA
 9: 122    L 2013      NA
10: 122    L 2014       1
11: 122    L 2015       2
12: 124    L 2013      NA
13: 142    L 2014      NA
14: 152    L 2013      NA
15: 152    L 2014       1
16: 198    L 2014      NA

Counting duplicates in R

2 Answers2