0

I have taken a function from this post to create a random time efficiently generate a random sample of times and dates between two dates.

Here is my data set:

latemail <- function(N, st="2012/01/01", et="2012/12/31") {
       st <- as.POSIXct(as.Date(st))
       et <- as.POSIXct(as.Date(et))
       dt <- as.numeric(difftime(et,st,unit="sec"))
       ev <- sort(runif(N, 0, dt))
       rt <- st + ev
 }
 set.seed(42); 
 a<- print(latemail(9))  
 a<- sort(a)

data <- data.frame(time= a, place=c("Start", "B", "O", "A", "A", "Start", "A", "O", "A"), ID=c(rep(1, 5), rep(2,4)))

The data looks like this:

                  time place ID
1  2012-02-19 04:40:45 Start  1
2  2012-04-14 12:34:56     B  1
3  2012-07-08 13:16:49     O  1
4  2012-08-22 07:41:26     A  1
5  2012-08-27 21:15:08     A  1
6  2012-09-14 10:22:03 Start  2
7  2012-09-25 22:30:49     B  2
8  2012-10-30 03:43:16     B  2
9  2012-11-29 22:42:03     O  2

I would like to take the time difference when place is "O" and "start", within each group (ID).

Questions: 1) Is the structure of the above data in accordance with tidy data? Because I think it makes more sense to spread the data so one can take difftime column wise. If each ID has only one row will it be tidy data (to separate e.g. between the A's one could call them A_1, A_2 if they have to be columns). But which format is tidy data.
2) Is there a better way to do accomplish this than bellow?

  data2 <- data %>% 
  filter(place %in% c("Start", "O")) %>% 
  group_by(ID) %>% 
  mutate(diff=difftime(lead(time), time, units="days")) %>% 
  filter(!is.na(diff))

Output:

# A tibble: 2 x 4
# Groups:   ID [2]
  time                place    ID diff            
  <dttm>              <fct> <dbl> <time>          
1 2012-02-19 04:40:45 Start     1 140.31671 days  
2 2012-09-25 22:30:49 Start     2 " 65.04947 days"
xhr489
  • 1,957
  • 13
  • 39

1 Answers1

1

We can keep the structure as it is but simplify the code a bit by using summarise (assuming you only have one "O" and "Start" for each ID.

library(dplyr)

data %>%
  group_by(ID) %>%
  summarise(diff = difftime(time[place == "O"], time[place == "Start"]))


#     ID diff            
#  <dbl> <time>          
#1     1 140.31671 days  
#2     2 " 65.04947 days"

If there are some ID's which do not have either "Start" or "O" we can return NA for them

data %>%
  group_by(ID) %>%
  summarise(diff = if (any(place == "O") & any(place == "Start"))
                   difftime(time[place == "O"], time[place == "Start"]) else NA)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks, that is more elegant! Could you please also comment regarding "tidy data". I have edited my question a bit. – xhr489 Nov 28 '18 at 14:40
  • @David I don't think there is any need to `spread` the data here. You want to take subtraction from only two values here in each group which is possible using `summarise`. Making the data wide would add unnecessary complications. – Ronak Shah Nov 28 '18 at 14:43
  • Okay thanks. With my real data set I get an error `Column diff must be length 1 (a summary value), not 0`. I think maybe because some ID's dont have an "O". Do you have a solution or should I post a new question? – xhr489 Nov 28 '18 at 14:53
  • @David If there is is no "O" for a group what should be the output? – Ronak Shah Nov 28 '18 at 14:55
  • @Roank Shah: I guess NA or some text saying that (e.g. no "O"). – xhr489 Nov 28 '18 at 15:01
  • Hi again, so now I get this error: ` Column `diff` must be length 1 (a summary value), not 2`. I guess that in some groups I have more than one row with "O". How can I return something (other than NA) so I can distinguish this situation. – xhr489 Nov 29 '18 at 10:24
  • @David in that case use `which.max` to get the first match. `data %>% group_by(ID) %>% summarise(diff = if (any(place == "O") & any(place == "Start")) difftime(time[which.max(place == "O")], time[which.max(place == "Start")]) else NA)` – Ronak Shah Nov 29 '18 at 13:19