11

I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.

rank observations in group

I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.

I have the first two columns and want the third one:

person  wave   obs
pers1   1999   1
pers1   2000   2
pers1   2003   3
pers2   1998   1
pers2   2001   2

Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).

Thanks for any pointers!

# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]

person.obs$n.obs = 0

# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
   print(unp[i])
   person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs = 
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
    i=i+1
   gc()
}

# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
  if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
  e = 0
  }
  e=e+1
  person.obs[i,]$n.obs = e
  i=i+1
  gc()
}
Jaap
  • 81,064
  • 34
  • 182
  • 193
Ruben
  • 3,452
  • 31
  • 47

4 Answers4

14

The answer from Marek in this question has proven very useful in the past. I wrote it down and use it almost daily since it was fast and efficient. We'll use ave() and seq_along().

foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011))

foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along))
foo

  person year obs
1  pers1 1999   1
2  pers1 2000   2
3  pers1 2003   3
4  pers2 1998   1
5  pers2 2011   2

Another option using plyr

library(plyr)
ddply(foo, "person", transform, obs2 = seq_along(person))

  person year obs obs2
1  pers1 1999   1    1
2  pers1 2000   2    2
3  pers1 2003   3    3
4  pers2 1998   1    1
5  pers2 2011   2    2
Community
  • 1
  • 1
Chase
  • 67,710
  • 18
  • 144
  • 161
  • Are you taking about: http://stackoverflow.com/questions/6150968/adding-an-repeated-index-for-factors/6151333#6151333? – Marek May 29 '11 at 23:15
  • @Marek Thanks to you too then. – Ruben May 30 '11 at 00:33
  • @Marek - yes! I use that code snippet several times per week and have it in a file of "useful code snippets" at work. I should start copying down the source of those code snippets. Thanks again. I'll update the Q accordingly. I wonder if these two questions should be merged now since they are essentially identical? – Chase May 30 '11 at 02:07
5

A few alternatives with the and packages.

data.table:

library(data.table)
# setDT(foo) is needed to convert to a data.table

# option 1:
setDT(foo)[, rn := rowid(person)]   

# option 2:
setDT(foo)[, rn := 1:.N, by = person]

both give:

> foo
   person year rn
1:  pers1 1999  1
2:  pers1 2000  2
3:  pers1 2003  3
4:  pers2 1998  1
5:  pers2 2011  2

If you want a true rank, you should use the frank function:

setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]

dplyr:

library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())

both giving a similar result:

> foo
Source: local data frame [5 x 3]
Groups: person [2]

  person  year    rn
  (fctr) (dbl) (int)
1  pers1  1999     1
2  pers1  2000     2
3  pers1  2003     3
4  pers2  1998     1
5  pers2  2011     2
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • That's how I do it these days too, 4 years later. Ugh my code back then. I used a loop, manually incremented the loop counter and used `gc`. Wow. – Ruben Feb 17 '16 at 10:40
  • @Ruben I recognize the experience ;-) – Jaap Feb 17 '16 at 10:41
  • @Ruben Although I won't object, it is not necessary to accept my answer. I was just adding this answer as this questions is regularly used as a duplicate target. – Jaap Feb 17 '16 at 10:47
  • I think it's the better way. The ave snippet is something I have to look up if I haven't used it in a while, the dplyr way is faster and more idiomatic. – Ruben Feb 17 '16 at 10:53
2

Would by do the trick?

> foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011),obs=c(1,2,3,1,2))
> foo
  person year obs
1  pers1 1999   1
2  pers1 2000   2
3  pers1 2003   3
4  pers2 1998   1
5  pers2 2011   2
> by(foo, foo$person, nrow)
foo$person: pers1
[1] 3
------------------------------------------------------------ 
foo$person: pers2
[1] 2
lindelof
  • 34,556
  • 31
  • 99
  • 140
  • D'oh I didn't say it clearly 'nough: I need the `obs` column, I don't have it. I don't need the number of observations per person, I got that (did it with sqldf) – Ruben May 28 '11 at 16:13
0

Another option using aggregate and rank in base R:

foo$obs <- unlist(aggregate(.~person, foo, rank)[,2])

 # person year obs
# 1  pers1 1999   1
# 2  pers1 2000   2
# 3  pers1 2003   3
# 4  pers2 1998   1
# 5  pers2 2011   2
989
  • 12,579
  • 5
  • 31
  • 53