3

I have a dataset which include the following two columns. They refer to sick leave.

df <- data.frame(BEGIN_DT = c("04/12/2013", "14/01/2013", "31/12/2012", "24/09/2013", "31/12/2013"),
           END_DT = c("04/12/2013", "15/01/2013", "04/01/2013", "27/09/2013",  "31/12/2013"))
rownames(df) <- c(16:20)

     BEGIN_DT     END_DT
16 04/12/2013 04/12/2013
17 14/01/2013 15/01/2013
18 31/12/2012 04/01/2013
19 24/09/2013 27/09/2013
20 31/12/2013 31/12/2013

What I am looking to do is to make five additional columns, (Q1, Q2, Q3, Q4, Total) counting sick days taken for each quarter, based on the BEGIN_DT and END_DT values.

For example for row 18:

  BEGIN_DT       END_DT  Q1  Q2  Q3  Q4  Total
31/12/2012   04/01/2013   4   0   0   1      5

I've looked here, and this seems to count the total, but I am really stuck with how to break these into quarters.

calculating number of days between 2 columns of dates in data frame 2-columns-of-dates-in-data-frame

survey <-     data.frame(date=c("2012/07/26","2012/07/25"),tx_start=c("2012/01/01","2012/01/01    "))

survey$date_diff <- as.Date(as.character(survey$date), format="%Y/%m/%d")-
                  as.Date(as.character(survey$tx_start), format="%Y/%m/%d")
survey

currently, I have the following code, but it's only giving me the total number of days in Q1, leaving the rest blank. I'm subtracting the begin date from the end date, however after many many hours over the past week, I'm no closer to (or perhaps further away from) figuring out how to populate the new quarter columns.

sick2$Q1 <- if ("/%m/" < 4) {
  as.Date(as.character(sick2$END_DT), format="%d/%m/%Y") -
  as.Date(as.character(sick2$BEGIN_DT), format="%d/%m/%Y")  
} else { "0" }
sick2$Q2 <- if ("/%m/" > 3 & "/%m/" < 7) {
  as.Date(as.character(sick2$END_DT), format="%d/%m/%Y") -
  as.Date(as.character(sick2$BEGIN_DT), format="%d/%m/%Y")
} else { "0" }
sick2$Q3 <- if ("/%m/" > 6 & "/%m/" < 10) {
  as.Date(as.character(sick2$END_DT), format="%d/%m/%Y") -
  as.Date(as.character(sick2$BEGIN_DT), format="%d/%m/%Y")
} else { "0" }
sick2$Q4 <- if ("/%m/" > 9) {
  as.Date(as.character(sick2$END_DT), format="%d/%m/%Y") -
  as.Date(as.character(sick2$BEGIN_DT), format="%d/%m/%Y")
} else { "0" }

Anyway, if anyone could at least point me in the right direction here, I'd v.v.v. much appreciate it, Thanks, Adrian.

Community
  • 1
  • 1
daido
  • 31
  • 5

2 Answers2

0

Not pretty (eww.. loops) but here's my take at this:

require(lubridate)
df <- data.frame(BEGIN_DT = c("04/12/2013", "14/01/2013", "31/12/2012", "24/09/2013", "31/12/2013"),
           END_DT = c("04/12/2013", "15/01/2013", "04/01/2013", "27/09/2013",  "31/12/2013"))
rownames(df) <- c(16:20)
df$BEGIN_DT <- as.Date(df$BEGIN_DT, "%d/%m/%Y")
df$END_DT <- as.Date(df$END_DT, "%d/%m/%Y")

for(i in 1:5){
  diff_time <- as.numeric(df$END_DT[i] - df$BEGIN_DT[i])
  date_vec <- df$BEGIN_DT[i]
  for(j in 1:diff_time){
    if(diff_time > 0) {
    date_vec <- c(date_vec, df$BEGIN_DT[i] + j)
    }
  }
  date_vec <- as.character(quarters(date_vec))
  df$Q1[i] <- length(date_vec[date_vec %in% "Q1"])
  df$Q2[i] <- length(date_vec[date_vec %in% "Q2"])
  df$Q3[i] <- length(date_vec[date_vec %in% "Q3"])
  df$Q4[i] <- length(date_vec[date_vec %in% "Q4"])
}

First thing to do is accumulate vector of all dates included in the interval, then check it against lubridate quarters() and then count occurences. Can't think of a way to do it without loops. Maybe someone else can?

statespace
  • 1,644
  • 17
  • 25
  • really appreciate the thought put in A.Val. thanks.. Playing around with your lovely loops now :) I'm taking baby steps, and sure will update where I am again soon enough. – daido Mar 05 '15 at 15:25
0

Not really proud of this, but might give you some direction. There's got to be a better way...

library(lubridate)
library(plyr)

df <- data.frame(BEGIN_DT = c("04/12/2013",
                               "14/01/2013",
                               "31/12/2012",
                               "24/09/2013",
                               "31/12/2013"),
                 END_DT = c("04/12/2013",
                             "15/01/2013",
                             "04/01/2013",
                             "27/09/2013",
                             "31/12/2013"))

df$BEGIN_DT <- dmy(df$BEGIN_DT)
df$END_DT <- dmy(df$END_DT)

date_seq <- list()
for (i in seq_along(df$BEGIN_DT)) {
  date_seq[[i]] <- seq(from = df$BEGIN_DT[i], 
                       to   = df$END_DT[i], 
                       by   = "1 day")
}

# sapply(date_seq, quarter)
sapply(sapply(date_seq, quarters), table)
results <- sapply(sapply(date_seq, quarters), table)

ldply(results rbind)
#   Q4 Q1 Q3
# 1  1 NA NA
# 2 NA  2 NA
# 3  1  4 NA
# 4 NA NA  4
# 5  1 NA NA

cbind(df, ldply(results, rbind))
#     BEGIN_DT     END_DT Q4 Q1 Q3
# 1 2013-12-04 2013-12-04  1 NA NA
# 2 2013-01-14 2013-01-15 NA  2 NA
# 3 2012-12-31 2013-01-04  1  4 NA
# 4 2013-09-24 2013-09-27 NA NA  4
# 5 2013-12-31 2013-12-31  1 NA NA

Alternate Solution with library(magrittr)

Still requires date_seq list object

library(magrittr)

sapply(date_seq, quarters) %>%
  sapply(., factor, levels=c("Q1", "Q2", "Q3", "Q4")) %>%
  sapply(., table) %>%
  t(.) %>%
  cbind(df, .)

#     BEGIN_DT     END_DT Q1 Q2 Q3 Q4
# 1 2013-12-04 2013-12-04  0  0  0  1
# 2 2013-01-14 2013-01-15  2  0  0  0
# 3 2012-12-31 2013-01-04  4  0  0  1
# 4 2013-09-24 2013-09-27  0  0  4  0
# 5 2013-12-31 2013-12-31  0  0  0  1
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • thanks Jason (also for the edit suggestion). I'm surprised at ye folk wondering about better ways - makes me worry about this project I've taken on :p Again, I'll update ye with my progress as soon as I get somewhere worth reporting – daido Mar 05 '15 at 15:58