1

I have a table as shown.

df <- data.frame("name" = c("jack", "william", "david", "john"), 
                 "01-Jan-19" = c(NA,"A",NA,"A"),
                 "01-Feb-19" = c("A","A",NA,"A"),
                 "01-Mar-19" = c("A","A","A","A"),
                 "01-Apr-19" = c("A","A","A","A"),
                 "01-May-19" = c(NA,"A","A","A"),
                 "01-Jun-19" = c("A","SA","A","SA"),
                 "01-Jul-19" = c("A","SA","A","SA"),
                 "01-Aug-19" = c(NA,"SA","A","SA"),
                 "01-Sep-19" = c(NA,"SA","A","SA"),
                 "01-Oct-19" = c("SA","SA","A","SA"),
                 "01-Nov-19" = c("SA","SA",NA,"SA"),
                 "01-Dec-19" = c("SA","SA","SA",NA),
                 "01-Jan-20" = c("SA","M","A","M"),
                 "01-Feb-20" = c("M","M","M","M"))

Over a time period, each person journeys through of position progression (3 position categories from A to SA to M). My objective is: Calculate the average duration of A (assistant) position and SA (senior assistant) position. i.e. the duration between the date the first of one category appears, and the date the last of this category appears, regardless of missing data in between. I transposed the data using R “gather” function

df1 <- gather (df, "date", "position", 2:15)

then I am not sure how to best proceed. What might be the best way to further approach this?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Almond
  • 87
  • 1
  • 1
  • 5

1 Answers1

1

We can get the data in longer format and calculate the number of days between first date when the person was "SA" and the first date when he was "A".

library(dplyr)

df %>%
  tidyr::pivot_longer(cols = -name, names_to = 'person', values_drop_na = TRUE) %>%
  mutate(person = dmy(person)) %>%
  group_by(name) %>%
  summarise(avg_duration = person[match('SA', value)] - person[match('A', value)]) 

#  name    duration
#  <fct>   <drtn>  
#1 david   275 days
#2 jack    242 days
#3 john    151 days
#4 william 151 days

If needed the mean value we can pull and then calculate mean by adding to the above chain

%>% pull(duration) %>% mean
#Time difference of 204.75 days

data

df <- structure(list(name = c("jack", "william", "david", "john"), 
`01-Jan-19` = c(NA, "A", NA, "A"), `01-Feb-19` = c("A", "A", 
NA, "A"), `01-Mar-19` = c("A", "A", "A", "A"), `01-Apr-19` = c("A", 
"A", "A", "A"), `01-May-19` = c(NA, "A", "A", "A"), `01-Jun-19` = c("A", 
"SA", "A", "SA"), `01-Jul-19` = c("A", "SA", "A", "SA"), 
`01-Aug-19` = c(NA, "SA", "A", "SA"), `01-Sep-19` = c(NA, 
"SA", "A", "SA"), `01-Oct-19` = c("SA", "SA", "A", "SA"), 
`01-Nov-19` = c("SA", "SA", NA, "SA"), `01-Dec-19` = c("SA", 
"SA", "SA", NA), `01-Jan-20` = c("SA", "M", "A", "M"), `01-Feb-20` = c("M", 
"M", "M", "M")), row.names = c(NA, -4L), class = "data.frame")
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hi Ronak, I copied your code to execute, but get this error message:avg_duration 1 NA days Warning message: All formats failed to parse. No formats found. – Almond Feb 26 '20 at 07:23
  • @Almond probably, your system language is not English. Change it to English using https://stackoverflow.com/questions/16347731/how-to-change-the-locale-of-r and try again – Ronak Shah Feb 26 '20 at 07:24
  • result of Sys.getlocale() [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" – Almond Feb 26 '20 at 07:35
  • @Almond ok..then replace the `mutate` line with `mutate(person = dmy(sub('X', '', person))) %>%` keep rest of the code as it is if you are using the data from your post. – Ronak Shah Feb 26 '20 at 07:49
  • Thanks! get it-remove the X at the front. I only get one of the averages, may I ask how to display all the individual result? (avg_duration 1 242 days) – Almond Feb 26 '20 at 08:06
  • What do you mean by individual results? Result for only "jack"? You can add `%>% filter(name == 'jack')`. – Ronak Shah Feb 26 '20 at 08:07
  • I could only get Jack's results (242 days) displayed (without filtering the name). But you have all four people's results displayed. very strange, as I carbon copied the coding. – Almond Feb 27 '20 at 01:16
  • @Almond maybe you have `plyr` loaded in your environment. They might be masking some of the `dplyr` functions. Restart your R session and load only `dplyr` and then check again or explicitly use `dplyr::mutate` and `dplyr::summarise`. – Ronak Shah Feb 27 '20 at 01:28
  • Many thank!!! it worked! it was due to loading plyr and dplyr at the same time. Can I have one more question: I need to calculate the 1st A to the last A, instead of 1st A to 1st SA, how can I revise the code? I attempted tail function, but it did not work. – Almond Feb 27 '20 at 02:11
  • you can change the last line of `summarise` to `summarise(avg_duration = person[max(which(value == 'A'))] - person[min(which(value == 'A'))]) ` – Ronak Shah Feb 27 '20 at 02:20
  • Thank you so very much! extraordinary help with this! Thank you for your time. – Almond Feb 27 '20 at 02:25
  • With the same dataset. May I ask how to modify the codes for the following purpose? of periods in between the two As, how can I minus the period that are with other values (anything that is not A, e.g. SA, NA)? Thank you so much! – Almond Jun 24 '20 at 05:57
  • @Almond Probably it is better to ask new question instead of extending the question in comments. It is confusing for future readers to understand the exact problem. – Ronak Shah Jun 24 '20 at 06:15
  • Thank you very much! I will do as you suggested. – Almond Jun 24 '20 at 06:19