How to calculate number and duration of categorical spells by ID in R

Question

I have a longitudinal dataset that records person's employment status monthly for 45 months. I would like to be able to create two variables to add to this dataset: 1) Overall duration each person spent "Unemployed" 2) Number of unemployment spells

Ideally it would also skip over NAs without interrupting the spell

I've created an example dataset to make things simple:


    ID <- c(1:10, 1:10, 1:10)
    date <- c("2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", 
              "2006-09-01", "2006-09-01", "2006-09-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", 
              "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-11-01", 
              "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", 
              "2006-11-01", "2006-11-01")
    act <- c("Unemployed", "Employment", "Education", "Education", "Education", "Education", "Education", 
             "Education", "Education", "Unemployed", "Education", "Unemployed", "Unemployed", "Unemployed", 
             "Education", "Education", "Employment", "Education", "Education", "NA", "Unemployed", 
             "Unemployed", "NA", "Unemployed", "Education", "Employment", "Employment", "NA", "Education", 
             "Unemployed")
    df <- data.frame(ID, date, act)
    df[order(ID),]

       ID       date        act
    1   1 2006-09-01 Unemployed
    11  1 2006-10-01  Education
    21  1 2006-11-01 Unemployed
    2   2 2006-09-01 Employment
    12  2 2006-10-01 Unemployed
    22  2 2006-11-01 Unemployed
    3   3 2006-09-01  Education
    13  3 2006-10-01 Unemployed
    23  3 2006-11-01         NA
    4   4 2006-09-01  Education
    14  4 2006-10-01 Unemployed
    24  4 2006-11-01 Unemployed
    5   5 2006-09-01  Education
    15  5 2006-10-01  Education
    25  5 2006-11-01  Education
    6   6 2006-09-01  Education
    16  6 2006-10-01  Education
    26  6 2006-11-01 Employment
    7   7 2006-09-01  Education
    17  7 2006-10-01 Employment
    27  7 2006-11-01 Employment
    8   8 2006-09-01  Education
    18  8 2006-10-01  Education
    28  8 2006-11-01         NA
    9   9 2006-09-01  Education
    19  9 2006-10-01  Education
    29  9 2006-11-01  Education
    10 10 2006-09-01 Unemployed
    20 10 2006-10-01         NA
    30 10 2006-11-01 Unemployed

I tried a solution proposed by Roland at Calculate duration in R but I am not sure how to adapt it to give me results by ID and deal with NAs.


    library(data.table)
    setDT(df)
    df[, date := as.POSIXct(date, format = "%Y-%m-%d", tz = "GMT")]

    glimpse(df)
    df$act <- ifelse(df$act == "Unemployed",1,-1)
    df[, run := cumsum(c(1, diff(act) != 0))]

    df1 <- df[, list(act = unique(act), 
                               duration = difftime(max(date), min(date), unit = "weeks")), 
                        by = run]
    df1
        run act duration
     1:   1   1  0 weeks
     2:   2  -1  0 weeks
     3:   3   1  0 weeks
     4:   4  -1  0 weeks
     5:   5   1  0 weeks
     6:   6  -1  0 weeks
     7:   7   1  0 weeks
     8:   8  -1  0 weeks
     9:   9   1  0 weeks
    10:  10  -1  0 weeks
    11:  11   1  0 weeks

What I am after is to achieve this (duration here is in months but can be weeks or days):

    ID spell_count duration
1    1           2        2
2    2           1        2
3    3           1        1
...
10  10           1        2

Any help with this would be greatly appreciated, any links/literature/examples.

Thank you.

You say you have monthly data. So each month, it was recorded whether or not the person was unemployed -- and in the real data set you would have 45 rows per ID? In this case the length of unemployment status (in month) could be just a matter of counting `Act %in% "Unemployed"` for each ID ... — lebatsnok, Jan 09 '19 at 17:28
@lebatsnok Yes, I have around 7,000 IDs and 45 employment values corresponding to 45 monthly data collections. So in a wide format it would be 7,000 rows with 45 employment variables and in a long format it would be 315,000 rows with 1 employment variable. — Maria, Jan 09 '19 at 17:34

Guillermo Santamaría · Accepted Answer · 2019-01-09T19:30:43.443

1

I am using only your first code block, then for Overall duration, I do:


    library(data.table)
    setDT(df)
    df_duration = df[act=="Unemployed",.(duration = .N),by = ID]

the number of unemployment spells is a little bit trickier:


    df_spell_count = df[order(ID,date)]
    df_spell_count <- df_spell_count[!(is.na(act)|act=="NA")]
    df_spell_count[,previous_act := shift(act,1),by = ID]
    df_spell_count<-df_spell_count[act =="Unemployed" & (previous_act!="Unemployed" | is.na(previous_act))]
    df_spell_count<-df_spell_count[,.(spell_count =.N),by = ID]

If you want to merge both things, just:

df_stats <- merge(df_duration,df_spell_count, by = "ID", all.x = TRUE,all.y = TRUE)

Observe that this df does not contain rows for those users without unemployment periods.

edited Jan 09 '19 at 19:30

answered Jan 09 '19 at 18:13

Guillermo Santamaría

170
1
6

could I ask about a little extension of this code? Let's say I have another variable which tells me the interview date. Based on when a person was interviewed, I'd like to include a different time frame in which to calculate the duration and count. How would I adjust the code to do this? For instance, if interview_date is 2007-10-01 I want to include data from 2006-09-01 and 2006-10-01 but if interview_date is 2007-11-01 I want to include data from 2006-09-01, 2006-10-01, and 2006-11-01. – Maria Jan 09 '19 at 20:27
So if I understood well, you have an interview date and you want to base your stats in the activities previous to that date. So, I'd just filter out the data frame df according to the interview date of each User ID. If those dates are in a different df, interview_df, merge this with the original df, by User ID, and then just filter date <= interview_date. BTW, would you mind voting my response? Than you! – Guillermo Santamaría Jan 09 '19 at 23:34
Thank you so much for your help. I'm a new user and just got my voting privileges, hence the delay but I've voted now. It's a great response. – Maria Jan 10 '19 at 09:53

Rui Barradas · Answer 2 · 2019-01-09T21:08:38.467

With package tidyverse you can group by a variable (or more) and summarise very easily.

Before aggregating the data, I will coerce column date to class Date and replace the character strings "NA" by actual missing values, NA.

library(tidyverse)

is.na(df$act) <- df$act == "NA"
df$date <- as.Date(df$date)

df %>%
  group_by(ID, act) %>%
  summarise(spell_count = sum(act == "Unemployed", na.rm = TRUE),
            duration = difftime(last(date), first(date), units = "weeks")) %>%
  filter(act == "Unemployed") %>%
  select(-act)
## A tibble: 5 x 3
## Groups:   ID [5]
#     ID spell_count duration      
#  <int>       <int> <time>        
#1     1           2 8.714286 weeks
#2     2           2 4.428571 weeks
#3     3           1 0.000000 weeks
#4     4           2 4.428571 weeks
#5    10           2 8.714286 weeks

The code above will give just the rows where there is at least one act == "Unemployed".
If you want all rows the following base R solution will do it.

res <- lapply(split(df, df$ID), function(DF){
  i <- DF$act == "Unemployed"
  if(any(i, na.rm = TRUE))
    duration <- difftime(max(DF$date[i], na.rm = TRUE), min(DF$date[i], na.rm = TRUE), units = "weeks")
  else
    duration <- 0
  spell_count <- sum(i, na.rm = TRUE)
  data.frame(ID = DF$ID[1], spell_count, duration)

})

res <- do.call(rbind, res)
row.names(res) <- NULL
res
#   ID spell_count       duration
#1   1           2 8.714286 weeks
#2   2           2 4.428571 weeks
#3   3           1 0.000000 weeks
#4   4           2 4.428571 weeks
#5   5           0 0.000000 weeks
#6   6           0 0.000000 weeks
#7   7           0 0.000000 weeks
#8   8           0 0.000000 weeks
#9   9           0 0.000000 weeks
#10 10           2 8.714286 weeks

This is definitely getting there, thank you. There are two issues though. The duration column is showing overall duration rather than duration of the spell (e.g. ID 5 should have duration 0). Also, ID 10 should have spell_count 1 rather than 2. Is there a way to do this? — Maria, Jan 09 '19 at 17:39
As for the second, * ID 10 should have spell_count 1 rather than 2* rows number 10 and 30 both have `ID == 10` and `act == "Unemployed"`. So it should be `2`. Or am I wrong? — Rui Barradas, Jan 09 '19 at 17:43
You're absolutely right, I was just wondering whether there is a way to ignore row number 20 which is NA and thus make it a single spell rather than having the NA value split the spell into two. I suppose I could remove all rows that have NAs in them and this should take care of it. — Maria, Jan 09 '19 at 17:49
Yeah, but if you for instance if you look at ID 2, it should also have spell_count only 1 as it is 1 continuous spell (of 2 months). So basically, what you provided as spell_count column is actually a duration column in months, which is really helpful (thank you) but I still don't know how to figure out number of spells. — Maria, Jan 09 '19 at 18:05

score 0 · Answer 3 · answered Jul 16 '19 at 17:52

Here's another attempt using the tidyverse. Data on "spells" is a common transformation of panel data; in the tidyverse approach the trick I think is to generate a spell variable, like the "run" variable in the OP's original code.

# libraries
library(tidyverse)
library(zoo)
library(lubridate)

# example dataset
ID <- c(1:10, 1:10, 1:10)
date <- c("2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", "2006-09-01", 
          "2006-09-01", "2006-09-01", "2006-09-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", 
          "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-10-01", "2006-11-01", 
          "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", "2006-11-01", 
          "2006-11-01", "2006-11-01")
act <- c("Unemployed", "Employment", "Education", "Education", "Education", "Education", "Education", 
         "Education", "Education", "Unemployed", "Education", "Unemployed", "Unemployed", "Unemployed", 
         "Education", "Education", "Employment", "Education", "Education", "NA", "Unemployed", 
         "Unemployed", "NA", "Unemployed", "Education", "Employment", "Employment", "NA", "Education", 
         "Unemployed")
df <- data.frame(ID, date, act)
df[order(ID),]

# convert types of some variables (in particular use zoo::yearmon instead of date, since these are actually yearmonth combos)
df$act <- as.character(df$act)
df$date <- lubridate::ymd(df$date)
df$yearmon <- zoo::as.yearmon(df$date)
df$act <- ifelse(df$act=='NA',NA,df$act)


# construct "act2", which is act, except when an NA is surrounded by the SAME act before and after, it is replaced with that same act
# e.g. Unemployed NA Unemployed -> Unemployed Unemployed Unemployed
# e.g. Education NA Unemployed -> stays the same
# (see note at the end of this discussion for more details on this)
df <- df %>% arrange(ID,date)

df <- df %>% group_by(ID) %>% mutate(
  act2 = ifelse(is.na(act) & (lag(act)==lead(act)), lead(act), act)
)

# create "spell" variable, which is like the "run" variable in the example code
# within ID this identifies the spell that is currently taken place 
# --- this is the most important part of the code ---
df <- df %>% group_by(ID) %>% mutate(
  spell = cumsum(coalesce(is.na(act2) | act2!=lag(act2),FALSE)) + 1
)

# add yearmonth + 1 month, in order to do duration calculations
# (I'm again exploiting the fact that your data is monthly. if this were not true, this variable could be lead(date), within ID. but then we'd have to figure out how to deal with ends of the panel, where lead(date) is NA)
df$yearmonplusmonth <- df$yearmon + (1/12)

# construct a dataset of ID-spell combinations
spells <- df %>% group_by(ID,spell) %>% summarize(
  spelltype = first(act2),
  duration = (max(yearmonplusmonth) - min(yearmon))*12
)

# construct a dataset at the ID level, with desired summaries of spells
spellsummary <- spells %>% group_by(ID,spelltype) %>% summarize(
  spell_count = n(),
  duration = sum(duration)
) 

# if there are no spells of a given spelltype, it doesn't appear in spellsummary
# we need to fill out spellsummary with zeroes in ID-spelltype cases where there are no spells:
temp <- expand.grid(ID = unique(spellsummary$ID), spelltype = unique(spellsummary$spelltype))
spellsummary <- full_join(spellsummary,temp,by=c('ID','spelltype'))
spellsummary <- spellsummary %>% mutate_at(vars(spell_count,duration),funs(coalesce(as.numeric(.),0)))
spellsummary <- spellsummary %>% mutate_at(vars(spell_count,duration),funs(round(.,0)))
spellsummary <- spellsummary %>% arrange(ID,spelltype)

# finally, we just want Unemployed spelltype summaries by ID:
spellsummary %>% filter(spelltype=='Unemployed')

# A tibble: 10 x 4
# Groups:   ID [10]
# ID spelltype  spell_count duration
# <int> <chr>            <dbl>    <dbl>
# 1     1 Unemployed           2        2
# 2     2 Unemployed           1        2
# 3     3 Unemployed           1        1
# 4     4 Unemployed           1        2
# 5     5 Unemployed           0        0
# 6     6 Unemployed           0        0
# 7     7 Unemployed           0        0
# 8     8 Unemployed           0        0
# 9     9 Unemployed           0        0
# 10    10 Unemployed           1        3

Note: I get 3 for the duration in the last row, rather than 2 in the desired output of the OP. The reason is that I assume that Unemp NA Unemp is really Unemp Unemp Unemp, both for purposes of spell_count AND for purposes of duration. The OP wants this to be the case for spell_count but not for duration. To achieve this, one approach might be to use the "act" variable for duration calculations and the "act2" variable for spell_count calculations -- I leave this to the reader.

How to calculate number and duration of categorical spells by ID in R

3 Answers3

Linked