0

I have data on peoples employment status monthly for 1 year. The dataframe includes 4 variables: ID of a Person, Country, Month and Main Activity in that specific month (Employed, Unemployed, Inactive, Other). I put an example of it here:

ID <- c(1:10, 1:10)
country <- c("AT", "BE", "CH", "CZ", "HR", "SO", "SV", "RU", "GR", "GE", "AT", "BE", "CH", "CZ", "HR", 
             "SO", "SV", "RU", "GR", "GE")
month <- c("Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec", "Nov", "Sep", "Jan", "Jun", "Jul", "Oct",
           "Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec")
act <- c("Unemployed", "Employed", "Other", "Other", "Inactive", "Unemployed", "Employed", 
         "Employed", "Employed", "Unemployed", "Other", "Unemployed", "Unemployed", "Unemployed", 
         "Other", "Other", "Employed", "Other", "Other", "NA")
df <- data.frame(ID, country, month, act)
df[order(ID),]

   ID country month        act
1   1      AT   Jan Unemployed
11  1      AT   Jun      Other
21  1      AT   Nov Unemployed
2   2      BE   Feb   Employed
12  2      BE   Jul Unemployed
22  2      BE   Sep Unemployed
3   3      CH   Mar      Other
13  3      CH   Oct Unemployed
23  3      CH   Jan         NA
4   4      CZ   Apr      Other
14  4      CZ   Jan Unemployed
24  4      CZ   Jun Unemployed
5   5      HR   May   Inactive
15  5      HR   Feb      Other
25  5      HR   Jul      Other
6   6      SO   Aug Unemployed
16  6      SO   Mar      Other
26  6      SO   Oct   Employed
7   7      SV   Dec   Employed
17  7      SV   Apr   Employed
27  7      SV   Nov   Employed
8   8      RU   Nov   Employed
18  8      RU   May      Other
28  8      RU   Jan         NA
9   9      GR   Sep   Employed
19  9      GR   Aug      Other
29  9      GR   Jun   Inactive
10 10      GE   Jan Unemployed
20 10      GE   Dec         NA
30 10      GE   Aug Unemployed

My goal is to create a new dataframe, where every row represents one spell of employment but with the condition that prior to AND after the spell of employment had to be a spell of unemployment. So that I am able to include only spells of employment where people switched from unemployment to employment and back to unemployment and calculate the duration of these spells. Ideally in the end there would be 4 variables: PersID, Country, duration of spell, starting month, end month. It should look like this:

   ID country spell_duration starting ending
1   1      AT              5      Jan    May
11  1      AT              5      Jun    Oct
2   2      BE              7      Feb    Aug
12  2      BE              6      Jul    Dec
3   3      CH             10      Mar    Dec
13  3      CH              1      Oct    Oct
4   4      CZ              8      Apr    Nov
14  4      CZ              5      Jan    May
5   5      HR              5      May    Sep
15  5      HR              4      Feb    May
6   6      SO              2      Aug    Sep
16  6      SO              6      Mar    Aug
7   7      SV              1      Dec    Dec
17  7      SV              9      Apr    Dec
8   8      RU              8      Nov    Dec
18  8      RU              7      May    Nov
9   9      GR              3      Sep    Nov
19  9      GR              2      Aug    Sep
10 10      GE              8      Jan    Aug
20 10      GE              1      Dec    Dec

I already found this solution by Maria (How to calculate number and duration of categorical spells by ID in R) but her problem is different. I don't want the overall duration of employment and I also don't really need the number of spells

  • How do you calculate the columns `starting` and `ending`? – ThomasIsCoding Aug 12 '21 at 08:37
  • That's part of my question. Sorry if it wasn't clear enough in the post! –  Aug 12 '21 at 08:50
  • Better show the logic about how those two columns are achieved. – ThomasIsCoding Aug 12 '21 at 08:52
  • I don't know how to achieve these two columns, that's part of my problem. What I want to achieve is the starting month and the ending month of the employment spell as well as the duration of it. –  Aug 12 '21 at 08:57

2 Answers2

0

Without much thought, the first thing that came to my mind. Very cumbersome, though. I'm sure there are more elegant solutions to this, but this doesn't require any additional packages.

data <- df
Empl_spells <- data.frame(ID = c(), Start = c(), End = c())

for(user in unique(data$ID)){
  # subset per user
  user_dat <- data[data$ID == user,]
  # initiate a list to store where changes occur and a counter for
  # entries to this list
  if(nrow(user_dat) > 2){
    Changes_data <- list()
    entry <- 1
    # for every row, check if it switches from employed to unemployed
    # or the opposite. Mark with "break" if some other entry interrupts
    for(i in 2:nrow(user_dat)){
      if(user_dat$act[i] == "Employed" &
         user_dat$act[i-1] == "Unemployed"){
        Changes_data[[entry]] <- c("Start", i)
        entry <- entry + 1
      }else if(user_dat$act[i] == "Unemployed" &
               user_dat$act[i-1] == "Employed"){
        Changes_data[[entry]] <- c("End", i)
        entry <- entry + 1
      }else if(user_dat$act[i] != "Employed" &
               user_dat$act[i] != "Unemployed"){
        Changes_data[[entry]] <- c("Break", i)
        entry <- entry + 1
      }
    }
    # see where to an "End" follows a "Start" immediately in the new list
    Changes_df <- do.call(rbind.data.frame, Changes_data)
    EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
    if(length(EmplToUnempl) >= 1){
      append <- data.frame(ID = user,
                           Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
                           End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
      # append the data to the data.frame for all of the people
      Empl_spells <- rbind(Empl_spells, append)
    }
  }
}

Since I don't have your data, I didn't test this. Is this what you want?

Edit (vectorize; probably makes it faster):

data <- df

users <- unique(data$ID)
calculate <- function(user){
  # subset per user
  user_dat <- data[data$ID == user,]
  # initiate a list to store where changes occur and a counter for
  # entries to this list
  if(nrow(user_dat) > 2){
    Changes_data <- list()
    entry <- 1
    # for every row, check if it switches from employed to unemployed
    # or the opposite. Mark with "break" if some other entry interrupts
    for(i in 2:nrow(user_dat)){
      if(user_dat$act[i] == "Employed" &
         user_dat$act[i-1] == "Unemployed"){
        Changes_data[[entry]] <- c("Start", i)
        entry <- entry + 1
      }else if(user_dat$act[i] == "Unemployed" &
               user_dat$act[i-1] == "Employed"){
        Changes_data[[entry]] <- c("End", i)
        entry <- entry + 1
      }else if(user_dat$act[i] != "Employed" &
               user_dat$act[i] != "Unemployed"){
        Changes_data[[entry]] <- c("Break", i)
        entry <- entry + 1
      }
    }
    # see where to an "End" follows a "Start" immediately in the new list
    Changes_df <- do.call(rbind.data.frame, Changes_data)
    EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
    if(length(EmplToUnempl) >= 1){
      append <- data.frame(ID = user,
                           Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
                           End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
      # append the data to the data.frame for all of the people
      return(append)
    }
  }
}

empl_spells <- lapply(users, FUN = calculate)
Empl_spells <- do.call(rbind.data.frame, empl_spells)

Edit #2 (calculate duration):

MonthToNumeric <- function(x){
  which(c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") == x)
}

calcDuration <- function(Start, End){
  return(MonthToNumeric(End) - MonthToNumeric(Start) + 1)
}

Empl_spells$Duration <- mapply(FUN = calcDuration, Start = Empl_spells[, 2], End = Empl_spells[, 3])
Manuel Popp
  • 1,003
  • 1
  • 10
  • 33
  • Sorry for the late response. I tried it with my data and I get an error called "Error in xi[[j]] : object of type 'closure' is not subsettable". (Also I think in `}else if((user_dat$act[i] == "Unemployed" &` there is one bracket too much) –  Aug 16 '21 at 10:15
  • Now that I have some data, I changed a few things. But is has become even more cumbersome, I guess... – Manuel Popp Aug 16 '21 at 20:18
  • I think it's working, because there are no errors. But it takes a really long time, it only got 560 employment spells after 20 minutes. Is there any way to speed up the process or is this due to the big dataset I'm working with (>6.000.000 observations)? –  Aug 17 '21 at 08:31
  • Oh and maybe you have also an idea on how to calculate the duration of each of these employment spells in a simple way and add it to the final dataframe? –  Aug 17 '21 at 08:52
  • ```for``` is slow in R and usually the computing time is lower when ```apply``` is used instead. For my solution, ```lapply``` is applicable for the outer ```for``` loop where the code is applied for every person in the list. I made some minor changes to do so in **Edit**. I also added a 2nd **Edit#2** with functions that replace the month abbreviations with numeric values and add a column "Duration" to the results. – Manuel Popp Aug 17 '21 at 20:29
  • I'd recommend you to calculate a few examples by hand and make sure the script returns the correct results. – Manuel Popp Aug 17 '21 at 20:30
  • Thanks for the effort, but it doesn't work. It takes forever as soon as it comes to the `lapply`-function and doesn't seem to finish (I waited 30 minutes). –  Aug 18 '21 at 08:42
  • It should work in theory but obviously, the code takes a huge amount of time with your data set. I am not sure whether there is a way to get results fast on your machine. If you have the possibility to run the program for a longer time, you could 1st test everything with a small subset (e.g. insert ```df <- df[1:1000, ]```) and if it works repeat for the complete set. You could also insert ```t_start <- Sys.time()``` prior to the ```lapply``` function and ```t_end <- Sys.time()``` after it, so you can see how long it took for 1000 entries and estimate how long it will take complete data. – Manuel Popp Aug 18 '21 at 16:41
0

I used data.table package for manupulation and loops are working i guess.

EDIT: One extra "}" left and I edited it. I've tried it and it works.

EDIT2: I added "setDT(df)" too.

library(data.table)

 df <- fread(paste("ID country month        act
    1      AT   Jan Unemployed
    1      AT   Jun      Other
    1      AT   Nov Unemployed
    2      BE   Feb   Employed
    2      BE   Jul Unemployed
    2      BE   Sep Unemployed
    3      CH   Mar      Other
    3      CH   Oct Unemployed
    3      CH   Jan         NA
    4      CZ   Apr      Other
    4      CZ   Jan Unemployed
    4      CZ   Jun Unemployed
    5      HR   May   Inactive
    5      HR   Feb      Other
    5      HR   Jul      Other
    6      SO   Aug Unemployed
    6      SO   Mar      Other
    6      SO   Oct   Employed
    7      SV   Dec   Employed
    7      SV   Apr   Employed
    7      SV   Nov   Employed
    8      RU   Nov   Employed
    8      RU   May      Other
    8      RU   Jan         NA
    9      GR   Sep   Employed
    9      GR   Aug      Other
    9      GR   Jun   Inactive
    10      GE   Jan Unemployed
    10      GE   Dec         NA
    10      GE   Aug Unemployed", collapse = '\n'))
setDT(df)
df[, monthInt := match(month, month.abb)]

df <- df[order(ID,monthInt)]

finalDt <- data.table()
for (i in unique(df[, ID])) {
  
  tempT <- df[ID == i]
  
  for (tim in 1:(nrow(tempT)-1)) {
    timT <- data.table(ID = tempT[tim,ID],
                       country = tempT[tim, country],
                       spell_duration = tempT[tim+1, monthInt] - tempT[tim, monthInt],
                       starting = month.abb[tempT[tim, monthInt]],
                       ending = month.abb[tempT[tim+1, monthInt]-1])
    
    finalDt <- rbind(finalDt,timT)
    
  }
}
gokhan can
  • 189
  • 9
  • I tried it with my data and the data.table package but there occurs an error with ":=". It says "Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways." I don't really get why, because when I try it with the example dataset you posted, it works. –  Aug 12 '21 at 11:08
  • One extra "}" left and I edited it. I've tried it and it works. – gokhan can Aug 12 '21 at 11:37
  • I still get an error in the end: "Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, : Incompatible join types: x.ID is type integer64 but i.ID is type double and contains fractions" . The ID-Variable doesn't contain any commas, it's integer with up to 7 digits. –  Aug 12 '21 at 12:56
  • There isn't any merge in my code, don't understand. Maybe problem is your data. Just check that variables class and make sure ID column is an integer. In data.table package you can do it like that: df[, ID := as.integer(ID)] – gokhan can Aug 12 '21 at 13:12