12

I have the following dataset:

df<-structure(list(IDFAM = c("2010 7599 2996 1", "2010 7599 3071 1", 
"2010 7599 3071 1", "2010 7599 3660 1", "2010 7599 4736 1", "2010 7599 6235 1", 
"2010 7599 6299 1", "2010 7599 9903 1", "2010 7599 11013 1", 
"2010 7599 11778 1", "2010 7599 11778 1", "2010 7599 12248 1", 
"2010 7599 13127 1", "2010 7599 14261 1", "2010 7599 16280 1", 
"2010 7599 16280 1", "2010 7599 16280 1", "2010 7599 16280 1", 
"2010 7599 16280 1", "2010 7599 17382 1"), AGED = c(45L, 47L, 
24L, 46L, 46L, 44L, 43L, 43L, 43L, 16L, 43L, 46L, 44L, 47L, 43L, 
16L, 20L, 18L, 18L, 43L)), .Names = c("IDFAM", "AGED"), row.names = c("5614", 
"5748", "5753", "6864", "8894", "11761", "11884", "18738", "20896", 
"22351", "22353", "23267", "24939", "27072", "30946", "30947", 
"30949", "30950", "30952", "33034"), class = "data.frame")

I would like to assign an ID to each observation having the same IDFAM value ranging from 1 to n, where n is the number of observations with the same value of IDFAM. This would result in the following table:

IDFAM              AGED     ID
2010 7599 2996 1    45       1
2010 7599 3071 1    47       1
2010 7599 3071 1    24       2
2010 7599 3660 1    46       1
2010 7599 4736 1    46       1
2010 7599 6235 1    44       1
2010 7599 6299 1    43       1
2010 7599 9903 1    43       1
2010 7599 11013 1   43       1
2010 7599 11778 1   16       1
2010 7599 11778 1   43       2
2010 7599 12248 1   46       1
2010 7599 13127 1   44       1
2010 7599 14261 1   47       1
2010 7599 16280 1   43       1
2010 7599 16280 1   16       2
2010 7599 16280 1   20       3
2010 7599 16280 1   18       4
2010 7599 16280 1   18       5
2010 7599 17382 1   43       1

How can I do this ? Thanks.

user2568648
  • 3,001
  • 8
  • 35
  • 52

2 Answers2

24

There are several ways.

In base R, use ave:

with(df, ave(rep(1, nrow(df)), IDFAM, FUN = seq_along))
#  [1] 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 3 4 5 1

With the "data.table" package, use sequence(.N):

library(data.table)
DT <- as.data.table(df)
DT[, ID := sequence(.N), by = IDFAM]

With the "dplyr" package, try:

df %>% group_by(IDFAM) %>% mutate(count = sequence(n()))

or (as recommended by Hadley in the comments):

df %>% group_by(IDFAM) %>% mutate(count = row_number(IDFAM))

Update

Since this seems to be something that is asked for relatively frequently, this feature has been added as a function (getanID) in my "splitstackshape" package. It is based on the "data.table" approach above.

library(splitstackshape)
getanID(df, id.vars = "IDFAM")
#                 IDFAM AGED .id
#  1:  2010 7599 2996 1   45   1
#  2:  2010 7599 3071 1   47   1
#  3:  2010 7599 3071 1   24   2
#  4:  2010 7599 3660 1   46   1
#  5:  2010 7599 4736 1   46   1
#  6:  2010 7599 6235 1   44   1
#  7:  2010 7599 6299 1   43   1
#  8:  2010 7599 9903 1   43   1
#  9: 2010 7599 11013 1   43   1
# 10: 2010 7599 11778 1   16   1
# 11: 2010 7599 11778 1   43   2
# 12: 2010 7599 12248 1   46   1
# 13: 2010 7599 13127 1   44   1
# 14: 2010 7599 14261 1   47   1
# 15: 2010 7599 16280 1   43   1
# 16: 2010 7599 16280 1   16   2
# 17: 2010 7599 16280 1   20   3
# 18: 2010 7599 16280 1   18   4
# 19: 2010 7599 16280 1   18   5
# 20: 2010 7599 17382 1   43   1
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

With dplyr 0.5 you can use the group_indices function. Although it do not support mutate, the following approach is straightforward:

df$id <- df %>% group_indices(IDFAM)
Rodrigo Remedio
  • 640
  • 6
  • 20