1

I have a dataset, df:

Ultimately I would like to be able to group the data into 'chunks' where the Folder column contains the string 'Out', making sure to consider the DATE that it is associated with. Is there a way to create a chunk for each instance that 'Out' occurs, while computing its duration.

  Folder               DATE
  Out                  9/9/2019 5:46:00                
  Out                  9/9/2019 5:46:01
  Out                  9/9/2019 5:46:02
  In                   9/9/2019 5:46:03
  In                   9/9/2019 5:46:04
  Out                  9/10/2019 6:00:01
  Out                  9/10/2019 6:00:02
  In                   9/11/2019 7:50:00
  In                   9/11/2019 7:50:01

I would like this output:

 New Variable        Duration
 Out1                 2 sec
 Out2                 1 sec

I have included the dput:

structure(list(Folder = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("In", "Outdata"), class = "factor"), Date =  structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L), .Label = c("9/10/2019 6:00", 
"9/11/2019 7:50", "9/9/2019 5:46"), class = "factor")), class = "data.frame", row.names = c(NA,      
-9L))

What I have tried so far:

 #Loading appropriate libraries

 library(dplyr)
 library(lubridate)

creating new variable that will first group the folder by the string 'Out'

(However, this is where I am not sure what to do because I wish to create a new variable for each 'Out' group and its time duration to ultimately plot this on a histogram.)

 newdf<-df %>%
 group_by(df$Folder) %>%            
 summarise(mutate(Duration = difftime(as.POSIXct(ss_EndTime, format = 
 "%m/%d/%Y %I:%M:%S %p"),as.POSIXct(ss_StartTime, 
 format = "%m/%d/%Y %I:%M:%S %p" ), units = "secs")))

I will continue researching, all suggestions are appreciated.

Lynn
  • 4,292
  • 5
  • 21
  • 44
  • 2
    Don't format everything as code. Please only use four space to make a block formatted as code. You had way too many spaces, pushing the code unnecessarily to left. You are sharing output of `str()` not `dput`. As you are asking questions frequently, please read this in details: providing a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). This comment is valid for many other questions that you've asked. – M-- Jan 29 '20 at 23:32
  • ok I will update to dput. – Lynn Jan 29 '20 at 23:37

2 Answers2

1

We can convert the DATE column to POSIXct, create a grouping column based on consecutive occurrence of Folder values, filter only "Out" values and get the difference between first and last time in each group.

library(dplyr)

df %>%
  mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M:%S"), 
         gr = cumsum(Folder != lag(Folder, default = TRUE))) %>%
   filter(Folder == "Out") %>%
   arrange(gr, DATE) %>%
   group_by(gr) %>%
   summarise(Duration = difftime(last(DATE), first(DATE))) %>%
   mutate(gr = paste0('Out', row_number()))

# A tibble: 2 x 2
#  gr    Duration
#  <chr> <drtn>  
#1 Out1  2 secs  
#2 Out2  1 secs  

data

df <- structure(list(Folder = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("In", "Out"), class = "factor"), DATE = structure(c(5L, 
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L), .Label = c("9/10/2019 6:00:01", 
"9/10/2019 6:00:02", "9/11/2019 7:50:00", "9/11/2019 7:50:01", 
"9/9/2019 5:46:00", "9/9/2019 5:46:01", "9/9/2019 5:46:02", "9/9/2019 5:46:03", 
"9/9/2019 5:46:04"), class = "factor")), class = "data.frame", 
row.names = c(NA, -9L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This worked well! Can you please explain what the g r= is doing? – Lynn Jan 29 '20 at 23:59
  • 1
    @TanishaHudson `gr` create groups for consecutive occurrence of values in `Folder`. Run only `df %>% mutate(gr = cumsum(Folder != lag(Folder, default = TRUE)))` to understand how the groups are created. – Ronak Shah Jan 30 '20 at 00:08
  • Ok I will. Thank you for your shared knowledge on the subject! – Lynn Jan 30 '20 at 00:21
  • Hi @Ronak, my DateTime is in this format here: 9/9/2019 5:46:40 PM and I getting this error as I am working with my full dataset. Error in as.POSIXlt.character(as.character(x), ...) : character string is not in a standard unambiguous format How can I overcome this error? I tried to convert, but I just learned I may not be able to convert this type – Lynn Jan 30 '20 at 01:30
  • 1
    @TanishaHudson Change the date line in `mutate` to `mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %I:%M:%S %p")` – Ronak Shah Jan 30 '20 at 01:34
  • This worked! I am also working with a large dataset here and some values are coming out as 30, but some come out as 20.777, I wonder if these are minutes or seconds or both? Is there a way to assure all the times are the same? Should I add units=secs somewhere? – Lynn Jan 30 '20 at 01:53
  • 1
    @TanishaHudson Yes, use `difftime(last(DATE), first(DATE), units = "secs")` or `difftime(last(DATE), first(DATE), units = "mins")` based on how you want output. – Ronak Shah Jan 30 '20 at 01:54
1

We can do this with lubridate/rleid

library(dplyr)
library(lubridate)
library(data.table)
library(stringr)
df %>% 
    mutate(DATE = mdy_hms(DATE), grp = rleid(Folder)) %>% 
    filter(Folder == "Out") %>%
    group_by(grp) %>% 
    summarise(Duration = diff(range(DATE))) %>% 
    mutate(grp = str_c("Out", row_number()))
# A tibble: 2 x 2
#  grp   Duration
#  <chr> <drtn>  
#1 Out1  2 secs  
#2 Out2  1 secs  

Or using the same with data.table

setDT(df)[, grp := rleid(Folder)][Folder == 'Out',
    .(Duration = diff(range(mdy_hms(DATE)))), .(grp = paste0("Out", grp))]

data

df <- structure(list(Folder = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("In", "Out"), class = "factor"), DATE = structure(c(5L, 
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L), .Label = c("9/10/2019 6:00:01", 
"9/10/2019 6:00:02", "9/11/2019 7:50:00", "9/11/2019 7:50:01", 
"9/9/2019 5:46:00", "9/9/2019 5:46:01", "9/9/2019 5:46:02", "9/9/2019 5:46:03", 
"9/9/2019 5:46:04"), class = "factor")), class = "data.frame", 
row.names = c(NA, -9L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Hi @akrun, this works as well. What is ‘rleid’, does it assist in grouping? Thank you – Lynn Jan 30 '20 at 00:31
  • 1
    @TanishaHudson It would be much faster and error proof from data.table and is easier to create groupings i.e. `rleid(c('a', 'b', 'a', 'a', 'b', 'c', 'c'))` – akrun Jan 30 '20 at 00:32
  • 1
    ah I see! I will see which is best – Lynn Jan 30 '20 at 01:00
  • I am wondering in order to convert,do I need to specify an origin? as.POSIXct(as.numeric(as.character(df$DATE)),origin="2019-09-09") I have this error in my own code Error in as.POSIXlt.character(as.character(x), ...) : character string is not in a standard unambiguous format – Lynn Jan 30 '20 at 01:24
  • 1
    @TanishaHudson `as.numeric` will only work if the values are numeric internally and if it is `"9/11/2019 7:50:00"` it won't work. Also, check the values that aree going into `as.POSIXct` i.e. if I do `as.numeric(Sys.time())#[1] 1580347561`, it can be converted back as `as.POSIXct(as.numeric(Sys.time()), origin = '1970-01-01')# [1] "2020-01-29 20:27:01 EST"` – akrun Jan 30 '20 at 01:27
  • 1
    Hm ok I am still trying to understand this. Not sure what to do. I will re-read your notes and work on this – Lynn Jan 30 '20 at 01:32
  • 1
    @TanishaHudson Glad to help you. If you have more questions, please do post as a new question. It will help in better understanding of the issue – akrun Jan 30 '20 at 01:44
  • 1
    @TanishaHudson I didn't meant in a bad way. It is just that it may be easier to understand for you and others. Sorry! – akrun Jan 30 '20 at 01:55