5

I have a set of data that looks like this:

          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1

From this, I am trying to create a new data frame that will show me what dates my runs of 1's occur as well as how long these runs are, with the column headers: 1) start date, 2) end date, and 3) length of run.

Ultimately, I want to create a data frame that looks like this from the data I have above:

  Start Date   End Date  Length of Run
1 2006-06-01 2006-06-03              3
2 2006-06-06 2006-06-07              2  

I have a few NA's in my data that I need to ignore throughout my data as well.

josliber
  • 43,891
  • 12
  • 98
  • 133
lg929
  • 234
  • 5
  • 13
  • I noticed you didn't include the last run of a single day -- do you only want to include a row of output if there are multiple days with `boolean` set to 1? This is what I assumed when writing up my response. – josliber Feb 02 '16 at 20:10

3 Answers3

4

You could do this with dplyr, using mutate to convert missing boolean values to 0, group_by to compute groups with constant values of variable boolean, filter to limit to groups where boolean was set to 1 and where the group had more than one member, and then summarize to grab the relevant summary information. (I take a few extra steps to remove the grouping variable at the end).

library(dplyr)
dat %>%
  mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
  group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
  filter(boolean == 1 & n() > 1) %>%
  summarize("Start Date"=min(as.character(Date)),
            "End Date"=max(as.character(Date)),
            "Length of Run"=n()) %>%
  ungroup() %>%
  select(-matches("group"))
#   Start Date   End Date Length of Run
#        (chr)      (chr)         (int)
# 1 2006-06-01 2006-06-03             3
# 2 2006-06-06 2006-06-07             2

Data:

dat <- read.table(text="          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1", header=T)
josliber
  • 43,891
  • 12
  • 98
  • 133
  • When I run this, I have an error that states: Error in UseMethod("ungroup") : no applicable method for 'ungroup' applied to an object of class "NULL". What should I do to make sure it's only ungrouping complete cases? @josilber – lg929 Feb 02 '16 at 20:14
  • @Ig929 -- the `ungroup` and `select` lines are purely cosmetic -- what do you get when you run the rest of the code? I have updated my answer to include the exact data frame I'm using -- if you're not able to replicate it on your own data frame you will probably need to include more details in your question (e.g. the output of `dput` applied on your data). – josliber Feb 02 '16 at 20:15
  • For my actual dataset, it is also including all of the runs with no days in between, for instance, it is calling 2006-07-05 to 2006-07-05 (0 days in between) and is labeling it as a length of run of "1" @josilber. But I can remove this with an additional step. – lg929 Feb 02 '16 at 20:21
  • @Ig929 if you want to include runs of length 1, just remove the code `& n() > 1` from the `filter` call. – josliber Feb 02 '16 at 20:28
  • This won't work if an `NA` boolean value is adjacent to 1's on both sides. `dat[complete.cases(dat),]` will collapse the separate groups into one long group. – Pierre L Feb 02 '16 at 20:36
  • @PierreLafortune ah, interesting -- thanks for pointing that out! I had assumed when the OP said "I have a few NA's in my data that I need to ignore throughout my data as well" they meant they wanted to remove those observations, but if an NA between runs of 1s splits them into separate runs then the correct action would be to replace NAs with 0s instead of removing those rows. I'm not sure which behavior the OP desires but that should be a pretty minor adjustment to do the replacement instead of the removal. – josliber Feb 02 '16 at 20:39
  • @PierreLafortune, I'm not sure I understand? So if I have NA, 1, NA, then this would be considered a run of 3? – lg929 Feb 02 '16 at 20:50
  • For the above example, in the `boolean` column replace the `0` in row 5 with a `1`. – Pierre L Feb 02 '16 at 20:51
  • You will see that the two groups will now be treated as one. – Pierre L Feb 02 '16 at 20:53
  • @lg929 The point that's unclear is whether 1, 1, NA, 1, 1 is considered two runs of 2, a single run of 4, or (unlikely?) a run of 5. My current solution treats it as a run of 4. – josliber Feb 02 '16 at 20:53
  • @josilber Thank you for clarifying. Yes, I should have been more specific in stating that when an NA splits a run of 1's, it should not be entirely ignored in that it would break this run, ie be considered a 0. – lg929 Feb 04 '16 at 22:33
  • @lg929 OK, I've updated the code to split up runs of 1 when there is an NA in between. – josliber Feb 04 '16 at 22:40
2

We can also use data.table to subset and cast the data as needed. First we create an id column with rleid(boolean). Next, subset the data according to the necessary conditions. Lastly, we create start, end, and run with the subsetted data:

library(data.table)
setDT(dat)[,id := rleid(boolean)][
  ,.SD[.N > 1 & boolean == 1],id][
  ,.(start=Date[1],end=Date[.N], run=.N),id]
#   id      start        end run
#1:  1 2006-06-01 2006-06-03   3
#2:  4 2006-06-06 2006-06-07   2
Pierre L
  • 28,203
  • 6
  • 47
  • 69
1

Another answer using base, reformatting this answer's use of cumsum and diff.

#Remove ineligible dates (defined by 0 or NA)
x1 <- x[!(x$boolean %in% c(NA, 0)), ]

x1$Date <- as.Date(x1$Date)  #Convert date from factor to Date class

#Starting at 0, if the difference between eligible dates is >1 day, 
#   add 1 (TRUE) to the previous value, else add 0 (FALSE) to previous value
#This consecutively numbers each series
x1$SeriesNo <-  cumsum(c(0, diff(x1$Date) > 1))

#          Date boolean SeriesNo
#407 2006-06-01       1        0
#408 2006-06-02       1        0
#409 2006-06-03       1        0
#412 2006-06-06       1        1
#413 2006-06-07       1        1
#415 2006-06-09       1        2

# Aggregate: Perform the function FUN on variable Date by each SeriesNo group
x2 <-  as.data.frame(as.list(
         aggregate(Date ~ SeriesNo, data= x1, FUN=function(zz) 
         c(Start = min(zz), End= max(zz), Run = length(zz) ))
       )) #see note after this code block

#Output is in days since origin.  Reconvert them into Date class
x2$Date.Start <- as.Date(x2$Date.Start, origin = "1970-01-01")
x2$Date.End   <- as.Date(x2$Date.End,   origin = "1970-01-01")

#  SeriesNo Date.Start   Date.End Date.Run
#1        0 2006-06-01 2006-06-03        3
#2        1 2006-06-06 2006-06-07        2
#3        2 2006-06-09 2006-06-09        1

A note on "buggy" output from aggregate: Using aggregate to apply several functions on several variables in one call

Community
  • 1
  • 1
oshun
  • 2,319
  • 18
  • 32