2

I am creating some artificial data. I need to create household ID (H_ID) and personal ID (P_ID, in each household).

I found a way how to create H_ID in vectorized way.

N <- 50

### Household ID
# loop-for
set.seed(20110224)
H_ID <- vector("integer", N)
H_ID[1] <- 1
for (i in 2:N) if (runif(1) < .5) H_ID[i] <- H_ID[i-1]+1 else H_ID[i] <- H_ID[i-1]
print(H_ID)

# vectorised form
set.seed(20110224)
r <- c(0, runif(N-1))
H_ID <- cumsum(r < .5)
print(H_ID)

But I can not figure out how to create P_ID in vectorized way.

### Person ID
# loop-for
P_ID <- vector("integer", N)
P_ID[1] <- 1
for (i in 2:N) if (H_ID[i] > H_ID[i-1]) P_ID[i] <- 1 else P_ID[i] <- P_ID[i-1]+1
print(cbind(H_ID, P_ID))

# vectorised form
# ???
nmichaels
  • 49,466
  • 12
  • 107
  • 135
djhurio
  • 5,437
  • 4
  • 27
  • 48

5 Answers5

4

Another example:

P_ID <- ave(rep(1, N), H_ID, FUN=cumsum)

I found out about the ave function a few days ago (here), and find it a really useful and efficient shortcut in many situations.

crayola
  • 1,668
  • 13
  • 16
2
P_ID <- unname(unlist(tapply(H_ID, H_ID, function(x)c(1:length(x)))))
kohske
  • 65,572
  • 8
  • 165
  • 155
1

Inspired by Martin Morgan's solution to a closely related question, here's a truly vectorized way to generate the P_ID using the cummax function. It becomes clear once you note that P_ID is closely related to the cumsum of !(r < 0.5):

set.seed(1)
N <- 10
r <- c(0, runif(N-1))
H_ID <- cumsum(r < .5)
r_ <- r >= .5 # flip the coins that generated H_ID.
z <- cumsum(r_)  # this is almost P_ID; just need to subtract the right amount...
# ... and the right amount to subtract is obtained via cummax
P_ID <- 1 + z - cummax( z * (!r_) )
> cbind(H_ID, P_ID)
      H_ID P_ID
 [1,]    1    1
 [2,]    1    2
 [3,]    2    1
 [4,]    3    1
 [5,]    3    2
 [6,]    3    3
 [7,]    3    4
 [8,]    4    1
 [9,]    5    1
[10,]    5    2

I haven't done detailed timing tests, but it's probably wicked fast, since these are all internal, vectorized functions

Community
  • 1
  • 1
Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73
  • I did timing tests (`N <- 2e6`). Your solution for sure is the fastest. It is around 34 times faster compared to `lapply` solution. Thanks! – djhurio Feb 28 '11 at 19:26
0

seq_along() is a useful tool here. This example splits H_ID by itself into a list containing the households:

> head(split(H_ID, H_ID))
$`1`
[1] 1 1

$`2`
[1] 2

$`3`
[1] 3 3 3 3
....

A solution to the Q then is to lapply() the seq_along() function to each list element; seq_along() creates a vector 1:length(foo). The final two housekeeping steps, unlist the result and then remove the names:

> unname(unlist(lapply(split(H_ID, H_ID), seq_along)))
 [1] 1 2 1 1 2 3 4 1 1 2 3 1 1 1 1 1 2 3 4 5 1 2 3 4 1 1 2 1 2 1
[31] 1 2 1 2 3 4 1 2 1 2 1 2 1 1 2 1 2 1 2 3
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
0

Here's a reasonably compact and expressive solution. Somewhat similar to Simpson's in terms of its intermediate values:

cbind(H_ID,   unlist( sapply(table(H_ID), seq) ) )

The core to its strategy is to use the table()-ed values as argument to seq() which by default will take a single numeric value and return a sequence from 1.

IRTFM
  • 258,963
  • 21
  • 364
  • 487