8

I am new on R and I have a data.frame , called "CT", containing a column called "ID" containing several hundreds of different identification numbers (these are patients). Most numbers appear once, but some others appear two or three times (therefore, in different rows). In the CT data.frame, I would like to insert a new variable, called "countID", which would indicate the number of occurrences of these specific patients (multiple records should still appear several times). I tried two different strategies after reading this forum: 1st strategy:

CT <- cbind(CT, countID=sequence(rle(CT.long$ID)$lengths)

But this doesn't work, I get only one count. 2nd strategy: create a data frame with two columns (one is ID, one is count) and the match this dataframe with CT:

tabs <- table(CT.long$ID)
out <- data.frame(item=names(unlist(tabs)),count=unlist(tabs)[],stringsAsFactors=FALSE)
rownames(out) = c()
head(out)

# item    count
# 1 1.312     1
# 2 1.313     2
# 3 1.316     1
# 4 1.317     1
# 5 1.321     1
# 6 1.322     1

So this works fine but I can't melt the two data.frames: the number of rows doesn't match between "out" and "CT" (out has less rows of course). Maybe someone has an elegant solution to add the number of occurrences directly in the data.frame CT, or correctly match the two data.frames?

Cœur
  • 37,241
  • 25
  • 195
  • 267
den
  • 169
  • 1
  • 1
  • 9
  • +1 for for showing input and expected output, but next time you post, make your example [**reproducible**](http://stackoverflow.com/q/5963269/1478381) by including some data. welcome to SO! – Simon O'Hanlon May 24 '13 at 13:59

3 Answers3

7

You were almost there! rle will work very nicely, you just need to sort your table on ID before computing rle:

CT <- data.frame( value = runif(10) , id = sample(5,10,repl=T) )

#  sort on ID when calculating rle
Count <- rle( sort( CT$id ) )

#  match values
CT$Count <- Count[[1]][ match( CT$id , Count[[2]] ) ]
CT
#       value id Count
#1  0.94282600  1     4
#2  0.12170165  2     2
#3  0.04143461  1     4
#4  0.76334609  3     2
#5  0.87320740  4     1
#6  0.89766749  1     4
#7  0.16539820  1     4
#8  0.98521044  5     1
#9  0.70609853  3     2
#10 0.75134208  2     2
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
4

data.table usually provides the quickest way

set.seed(3)
library(data.table)
ct <- data.table(id=sample(1:10,15,replace=TRUE),item=round(rnorm(15),3))
st <- ct[,countid:=.N,by=id]
 id   item countid
 1:  2  0.953       2
 2:  9  0.535       2
 3:  4 -0.584       2
 4:  4 -2.161       2
 5:  7 -1.320       3
 6:  7  0.810       3
 7:  2  1.342       2
 8:  3  0.693       1
 9:  6 -0.323       5
10:  7 -0.117       3
11:  6 -0.423       5
12:  6 -0.835       5
13:  6 -0.815       5
14:  6  0.794       5
15:  9  0.178       2
statquant
  • 13,672
  • 21
  • 91
  • 162
3

If you don't feel the need to use base R, plyr makes this task easy:

> set.seed(3)
> library(plyr)
> ct <- data.frame(id=sample(1:10,15,replace=TRUE),item=round(rnorm(15),3))
> ct <- ddply(ct,.(id),transform,idcount=length(id))
> head(ct)
  id   item idcount
1  2  0.953       2
2  2  1.342       2
3  3  0.693       1
4  4 -0.584       2
5  4 -2.161       2
6  6 -0.323       5
David
  • 9,284
  • 3
  • 41
  • 40