1

Much of the following problem arises from the sheer size of the dataframe (198240 observations). I'll try to break it down as best as I can.

The Goal

I want to create a variable DURATION which is how long a house was sick.

The Known

  • Household ID and Week (There are 1120 houses and 177 weeks)
  • HDINC (Currently Sick variable )
  • HDINC_1 (Sick Week Prior variable )

The Problem I don't understand how to get the function/loop to be traversing the dataframe in both household and time concurrently.

I know it will be a function or loop that goes something like the following (Not in R-code, but in logic)

   IF (hdinc > 0)       #a house on a certain date is sick 
       { Duration = 1 AND  look at hdinc_1 
           IF (hdinc_1 = 0 )
                { Duration = Duration + 0  
                  AND Go onto the next date for that house. 
           IF hdinc_1 >0 then       #if the house was sick last week
                 { Duration = Duration + 1   
                   Go to SameHouse, Week-1 and look at hdinc_1 to see if it was sick the week prior 

I am having trouble with the following:

  • Getting it to start on a particular observation based on household/date
  • Moving the function backwards or forwards while maintaining the household
  • Eventually getting the function to restart using a different household

I know this is really convoluted but I can't even get the loop to start to provide y'all sample code.

Sample Data:

dat <- structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L), survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L), .Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"), hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)), .Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L)) 

Sample Output: Sample Output

Jaap
  • 81,064
  • 34
  • 182
  • 193
Sheegor
  • 19
  • 1
  • 4
  • 3
    Maybe provide some sample data, if not sample code? – jeremycg Oct 23 '15 at 18:30
  • Also, some sample output would be nice. I'm not 100% sure exactly what you need. For example, if a house was sick every other week, would you want the output to be "1" (maximum sick duration) or "88" (total weeks sick) or "1,1,1,1,1,1,1,1,..." (list of sickness durations) or something else? Start date? – Derwin McGeary Oct 23 '15 at 18:39
  • Can you at least describe the structure of your data? For example, you mention having 198240 observations. Do you mean observations or do you mean elements in your matrix? Because 198240 happens to also equal 1120 houses * 177 weeks. Do you actually have 198240 rows? Depending on whether you can re-structure your data, you actually might not need a loop to solve this problem. – JFu Oct 23 '15 at 18:39
  • Ok I tried uploading a picture of how it would look if exported to Excel. Does that help? – Sheegor Oct 23 '15 at 18:40
  • instead of excel, try using `dput`. See the [instructions on this question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – jeremycg Oct 23 '15 at 18:51
  • 1
    structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L), survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L), .Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"), hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)), .Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L)) – Sheegor Oct 23 '15 at 19:13
  • why not just edit your question with that instead of posting it in the comments – rawr Oct 23 '15 at 21:16

2 Answers2

0

We can use the function rle in combination with dplyr to find runs, and then remove those where the run is of wellness:

library(dplyr)
dat %>% group_by(id_casa) %>%
        mutate(duration = unlist(lapply(rle(hdinc > 0)[["lengths"]], seq, from = 1))) %>%
        mutate(duration = ifelse(hdinc > 0, as.numeric(duration), 0))

Source: local data frame [11 x 5]
Groups: id_casa [2]

   id_casa   survdate    hdinc  hdinc_1 duration
     (int)     (fctr)    (dbl)    (dbl)    (dbl)
1      802  1/19/2005 125.0000   0.0000        1
2      802 12/28/2005 142.8571 125.0000        2
3      802   1/4/2006   0.0000 142.8571        0
4      802  1/11/2006   0.0000   0.0000        0
5      802  1/18/2006   0.0000   0.0000        0
6      802  1/25/2006 142.8571   0.0000        1
7      802   2/1/2006   0.0000 142.8571        0
8      955 10/13/2004  50.0000   0.0000        1
9      955 10/20/2004  32.0000  50.0000        2
10     955 10/27/2004 159.0000  32.0000        3
11     955  11/3/2004   2.5000 159.0000        4

How it works: first we find all the runs using rle:

rle(dat$hdinc>0)
Run Length Encoding
  lengths: int [1:5] 2 3 1 1 4
  values : logi [1:5] TRUE FALSE TRUE FALSE TRUE

We then make a seq from 0 to each of the lengths from the rle using lapply:

z <- unlist(lapply(rle(dat$hdinc > 0)[["lengths"]], seq, from = 1))
z
 [1] 1 2 1 2 3 1 1 1 2 3 4

Then we filter that by whether it was sickness or wellness:

ifelse(dat$hdinc > 0, z, 0)
 [1] 1 2 0 0 0 1 0 1 2 3 4

Using dplyr group_by we make sure we are running it on each id_casa by itself.

EDIT: In base:

dat$duration2 <- ifelse(dat$hdinc > 0,
                   unlist(by(dat, dat$id_casa, FUN = function(x) unlist(lapply(rle(x$hdinc > 0)[["lengths"]], seq, from = 1)))),
                   0)
jeremycg
  • 24,657
  • 5
  • 63
  • 74
0

Using only base R :

# create sample data
sampleData <-
  structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L), 
                 survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L), 
                                      .Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"), 
                 hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)), 
            .Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L))

# you must be sure the rows are already ordered, otherwise you can use something like:
#sampleData <- sampleData[order(sampleData$id_casa,sampleData$survdate),]

sampleData$Duration <- 
unlist(
   by(sampleData,
   INDICES=sampleData$id_casa,
   FUN=function(house){
     tail(Reduce(f=function(prv,nxt){if(nxt == 0) 0 else (prv+nxt)},
                 x=as.integer(house$hdinc > 0),init=0,accumulate=TRUE),-1)
     }))

> sampleData
   id_casa   survdate    hdinc  hdinc_1 Duration
1      802  1/19/2005 125.0000   0.0000        1
2      802 12/28/2005 142.8571 125.0000        2
3      802   1/4/2006   0.0000 142.8571        0
4      802  1/11/2006   0.0000   0.0000        0
5      802  1/18/2006   0.0000   0.0000        0
6      802  1/25/2006 142.8571   0.0000        1
7      802   2/1/2006   0.0000 142.8571        0
8      955 10/13/2004  50.0000   0.0000        1
9      955 10/20/2004  32.0000  50.0000        2
10     955 10/27/2004 159.0000  32.0000        3
11     955  11/3/2004   2.5000 159.0000        4
digEmAll
  • 56,430
  • 9
  • 115
  • 140