-1

I have been scratching my head trying to solve what seems a simple problem in R. Given the data set called 'DATA' below I want to count the number of incidents of the DATA$ID column and then put those into a new column. So for example the first entry of the new DATA$NEW would be 19 because the ID occurs 19 times. I can't seem to unravel how to do this.

PRAY NOTES        ID DURATION
    1    NA  <NA> 1_MENS_10       60
    2    NA  <NA> 1_MENS_10       60
    3    NA  <NA> 1_MENS_10       60
    4    NA  <NA> 1_MENS_10       60
    5    NA  <NA> 1_MENS_10       60
    6    NA  <NA> 1_MENS_10       60
    7    NA  <NA> 1_MENS_10       60
    8    NA  <NA> 1_MENS_10       60
    9    NA  <NA> 1_MENS_10       60
    10   NA  <NA> 1_MENS_10       60
    11   NA  <NA> 1_MENS_10       60
    12   NA  <NA> 1_MENS_10       60
    13   NA  <NA> 1_MENS_10       60
    14   NA  <NA> 1_MENS_10       60
    15   NA  <NA> 1_MENS_10       60
    16   NA  <NA> 1_MENS_10       60
    17   NA  <NA> 1_MENS_10       60
    18   NA  <NA> 1_MENS_10       60
    19   NA  <NA> 1_MENS_10       60
    20    2  <NA> 1_MENS_14       61
    21    3  <NA> 1_MENS_14       61
    22    2  <NA> 1_MENS_14       61
    23    1  <NA> 1_MENS_14       61
    24    1  <NA> 1_MENS_14       61
    25    3  <NA> 1_MENS_14       61
    26    2  <NA> 1_MENS_14       61
    27    3  <NA> 1_MENS_14       61
    28    1  <NA> 1_MENS_14       61
    29    3  <NA> 1_MENS_14       61
    30    3  <NA> 1_MENS_14       61

Here's a dput:

structure(list(PRAY = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, 3L, 2L, 1L, 1L, 3L, 2L, 
3L, 1L, 3L, 3L), NOTES = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "<NA>", class = "factor"), 
    ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L), .Label = c("1_MENS_10", "1_MENS_14"), class = "factor"), 
    DURATION = c(60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 
    60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 60L, 61L, 61L, 
    61L, 61L, 61L, 61L, 61L, 61L, 61L, 61L, 61L), NEW = c(19L, 
    19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 
    19L, 19L, 19L, 19L, 19L, 19L, 11L, 11L, 11L, 11L, 11L, 11L, 
    11L, 11L, 11L, 11L, 11L)), .Names = c("PRAY", "NOTES", "ID", 
"DURATION", "NEW"), row.names = c(NA, -30L), class = "data.frame")
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
user1658170
  • 814
  • 2
  • 14
  • 24
  • 1
    Apart from the answers below, you can also achieve it without additional packages. Try e.g. something like `df$new <- with(df,ave(rep(1, nrow(df)), id, FUN=length))`. – coffeinjunky Jun 29 '14 at 11:31

3 Answers3

3

Using data.table package:

library(data.table)
setDT(DATA)[, NEW := .N, by = ID]

DATA
##    PRAY NOTES        ID DURATION NEW
## 1:   NA  <NA> 1_MENS_10       60  19
## 2:   NA  <NA> 1_MENS_10       60  19
## 3:   NA  <NA> 1_MENS_10       60  19
## 4:   NA  <NA> 1_MENS_10       60  19
## 5:   NA  <NA> 1_MENS_10       60  19
## 6:   NA  <NA> 1_MENS_10       60  19
## 7:   NA  <NA> 1_MENS_10       60  19
....

setDT converts data.frame to data.table by reference (meaning, no copy of data is being made) and is therefore quite fast. And then, we aggregate by ID and add a new column NEW containing the counts for that group using .N an in-built special variable.

Note: In v1.9.3, setDF function is now exported, that lets you get back to data.frame, once again, by reference. So, if you'd like to stick to data.frame for some reason, you can just do: setDF(.) on the result.

Arun
  • 116,683
  • 26
  • 284
  • 387
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
2

plyr will do it pretty easily and let you stick with data frames:

library(plyr)
dat <- ddply(dat, .(ID), transform, NEW=length(ID))
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
1

Here's the dplyr equivalent, to complete the set:

library(dplyr)

DATA <- DATA %>% group_by(ID) %>% mutate(ID_Counts = n())

head(DATA)
#Source: local data frame [6 x 6]
#Groups: ID
#
#  PRAY NOTES        ID DURATION NEW ID_Counts
#1   NA  <NA> 1_MENS_10       60  19        19
#2   NA  <NA> 1_MENS_10       60  19        19
#3   NA  <NA> 1_MENS_10       60  19        19
#4   NA  <NA> 1_MENS_10       60  19        19
#5   NA  <NA> 1_MENS_10       60  19        19
#6   NA  <NA> 1_MENS_10       60  19        19
talat
  • 68,970
  • 21
  • 126
  • 157