3

I am trying write a function or use cut to assign a grouping variable to some date data when those dates are close (user definition of close). For example, I would like to create a common grouping variable for some samples that were collected on consecutive dates. I was thinking cut would work here but then I realized cut doesn't group variables when they are close and rather creates a series of groups based on a sequence.

So take this dataframe for example:

df <- structure(list(Num = c(0.888401849195361, 0.185766335576773, 
0.493163562379777, 0.13070688676089, 0.484760325402021, 0.603240836178884, 
0.893201333936304, 0.641203448642045, 0.16957180458121, 0.0101411847863346
), Date = structure(c(10592, 10597, 10598, 10605, 10606, 10608, 
10609, 10616, 10617, 10618), class = "Date"), day = c(1L, 6L, 
7L, 14L, 15L, 17L, 18L, 25L, 26L, 27L)), .Names = c("Num", "Date", 
"day"), row.names = c(NA, -10L), class = "data.frame")

If was to apply a cut function as I understand its usage like this:

df$cutVar <- cut(df$day, breaks= seq(0, 31, by = 1), right=TRUE) 

I would be left with a range that went right through values that I'd prefer to be grouped together. For example, the 6th and 7th should be grouped together based on their proximity to each other. Similar to 14th and 15th and so on.

> df
          Num       Date day  cutVar
1  0.88840185 1999-01-01   1   (0,1]
2  0.18576634 1999-01-06   6   (5,6]
3  0.49316356 1999-01-07   7   (6,7]
4  0.13070689 1999-01-14  14 (13,14]
5  0.48476033 1999-01-15  15 (14,15]
6  0.60324084 1999-01-17  17 (16,17]
7  0.89320133 1999-01-18  18 (17,18]
8  0.64120345 1999-01-25  25 (24,25]
9  0.16957180 1999-01-26  26 (25,26]
10 0.01014118 1999-01-27  27 (26,27]

So the basic question here is how to group a continuous variable (a date in this instance) such that close (defined by the user) numbers are grouped together in a factor range?

boshek
  • 4,100
  • 1
  • 31
  • 55
  • For groups of _consecutive_ dates, probably `rle` would be very useful. Other definitions of "closeness" it seems to me will become problematic very quickly due to the transitive property. – joran Jun 20 '16 at 21:25
  • Can you paste in an example of the output you want (or is that the `df` at the end)? Do you just want 2 adjacent rows combined? Eg, what if 1 is 'close' to 2, & 2 is close to 3, ..., 19 is close to 20, but 1 is not remotely 'close' to 20? Will the data already exist in a single unchanging set, or will new data be continually added over time? – gung - Reinstate Monica Jun 20 '16 at 21:26
  • @joran I completely agree with this. However, in my circumstance (which the example data may not capture), this will not be an issue. This is historical sampling data where someone has collected data over the space of one or two days then waited a month, and then did the same thing. Is it possible to outline the usage of `rle`? – boshek Jun 21 '16 at 15:44
  • @boshek I had in mind an approach sort of like [this](http://stackoverflow.com/a/18508644/324364), although it looks like `rle` doesn't turn out to be necessary. (You'd need to operate on the integer representation of your dates.) – joran Jun 21 '16 at 16:06

1 Answers1

2

Is this something you'd like? where 3 is a threshold I chose for convenience. It can be any number you prefer:

df$group <- cumsum(c(1, diff.Date(df$Date)) >= 3)
df
          Num       Date day group
1  0.88840185 1999-01-01   1     0
2  0.18576634 1999-01-06   6     1
3  0.49316356 1999-01-07   7     1
4  0.13070689 1999-01-14  14     2
5  0.48476033 1999-01-15  15     2
6  0.60324084 1999-01-17  17     2
7  0.89320133 1999-01-18  18     2
8  0.64120345 1999-01-25  25     3
9  0.16957180 1999-01-26  26     3
10 0.01014118 1999-01-27  27     3
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • I am not fully understanding the usage here as I am not able to apply it generally. For example if I add another sequence of dates (`Date2`) and create a grouping based on those dates (`Group2`), Those dates - separate by only one day - are not grouped together: `df$Date2 <- structure(c(1118620800, 1118620800, 1118620800, 1118707200, 1118707200, 1118707200, 1118707200, 1118707200, 1118707200, 1118707200), class = c("POSIXct", "POSIXt" ), tzone = "UTC")` `df$Group2 <- cumsum(c(1, diff.Date(df$Date2)) >= 1)`. Any thoughts on how I might be able to rectify this? – boshek Jun 21 '16 at 15:42
  • Because you are choosing the threshold to be `1`, which means you will only group same date. If you want to group dates that have one day difference set the threshold to be higher than one. Similarly, if two days difference is also considered within a group, set the threshold higher than two etc. – Psidom Jun 21 '16 at 15:54
  • Another problem is that before applying the `diff.Date`, you need to convert the `Date` column to `Date` class so that `diff.Date` will return difference in Date instead of seconds. so `df$Date2 <- as.Date(df$Date2)`. – Psidom Jun 21 '16 at 15:59
  • See the above comments. You `Date2` is of `POSIXct` class. – Psidom Jun 21 '16 at 16:00
  • Yes this works quite well. Any idea how to label these any more descriptively? Say for example, instead of an integer, the `Group2` variable printed the first date that sets the threshold? In this example rather than `Group2` printing all zeroes, we print `2005-06-13' which would be an appropriate label in that group? – boshek Jun 21 '16 at 16:09
  • 1
    Try something like df %>% group_by(group) %>% mutate(new_name=Date[1]). – Psidom Jun 21 '16 at 16:19