1

I have a database with a 142 columns with one called "Date" (of class POSIXct) that I'd like to make a new column from that groups consecutive dates together. Dates with more than 2 days separating one another are categorized into separate groups.

I'd also like to name the level of the group with the name of month the consecutive dates start in (For example: Jan. 3rd, 2018 -> Jan. 12th 2018 = group level called "January sampling event"; Feb 27th, 2018 -> March 1st, 2018 = group level called "February sampling event"; etc...).

I've seen very similar questions like Group consecutive dates in R and R: group dates that are next to each other, but just can't get it to work for my data.

EDIT: My data example (Last row shows dates separated by over a year are grouped together, for some reason)

    > dput(df)
structure(list(Date = structure(c(17534, 17535, 17536, 17537, 
18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1, 
1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df", 
"tbl", "data.frame"))

My attempt:

df$group <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))
Nate
  • 411
  • 2
  • 10
  • Please make a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) or [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) with a sample input (not posted in an image, consider using `dput()`) and your expected output. So we can understand your problem and think about a possible solution and verify it compared to your expected output. – Martin Gal Apr 15 '22 at 22:04
  • I recently added a small version of my data. – Nate Apr 16 '22 at 01:42

1 Answers1

2

Remove time from date time

It's hard to tell exactly what the problem is without seeing your data (or similar example data), but my guess is that the date time format (the 00:00:00 part) is messing up as.Date

One solution would be to extract just the date part and then try again with just the date part:

# here are your date times
date_time <- "2018-01-03 00:00:00"

# this looks for 4 digits between 0 and 9, followed by a dash, followed by 2 digits between 0 and 9,followed by a dash, followed by 2 digits between 0 and 9  
date_pattern <- " ?([0-9]{4}-[0-9]{2}-[0-9]{2}) ?"

#need this library
library(stringr)
library(magrittr) #for pipes

#this pulls out text matching the pattern we specified in date pattern
date_new <- str_extract(date_time, date_pattern) %>% 
  str_squish()   # this removes white space

# this is the new date without the time
date_new

# then we convert to as date
date_new <- as.Date(date_new)

See if converting your date column to just dates and then rerunning your grouping works.

If you have dates in different formats and need to adapt the regular expression, here's something about regular expressions: https://stackoverflow.com/a/49286794/16502170

Group dates

Let's start with an example data frame that contains a date column

# here's a bunch of example dates:
library(lubridate)
dates2 <- seq.Date(as.Date("2018-03-01"),by="days",length.out = 60)

#here's the dataframe
exampl_df <- data.frame(animals = rep(c("cats","dogs","rabbits"),20), dates=dates2,
                        numbers= rep(1:3,20))

Here's what it looks like:

head(exampl_df)
  animals      dates numbers
1    cats 2018-03-01       1
2    dogs 2018-03-02       2
3 rabbits 2018-03-03       3
4    cats 2018-03-04       1
5    dogs 2018-03-05       2
6 rabbits 2018-03-06       3

Then let's make a sequence of every day between the minimum and maximum date in the sequence. This step is important because there may be missing dates in our data that we still want counting towards the separation between days.

# this is a day by day sequence from the earliest day in your data to the latest day
date_sequence <- seq.Date(from = min(dates2),max(dates2),by="day")

Then let's make a sequence of numbers each repeated seven times. If you wanted to group every three days, you could change each to 3. Then the length.out= length(date_sequence) tells R to make this vector have as many entries as the min to max date sequence has:

# and then if you want a new group every seven days you can make this number sequence
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )

Then let's attach the groups to the date_sequence to make a grouping index

date_grouping_index <- data.frame(a=date_sequence,b=groups)

then you can do a join to attach the groups to the original dataframe

library(dplyr)
example_df 2 <- exampl_df %>% 
  inner_join(date_grouping_index, by=c("dates"="a"))

This is what we get:

head(example_df2,n=10)
   animals      dates numbers b
1     cats 2018-03-01       1 1
2     dogs 2018-03-02       2 1
3  rabbits 2018-03-03       3 1
4     cats 2018-03-04       1 1
5     dogs 2018-03-05       2 1
6  rabbits 2018-03-06       3 1
7     cats 2018-03-07       1 1
8     dogs 2018-03-08       2 2
9  rabbits 2018-03-09       3 2
10    cats 2018-03-10       1 2

Then you should be able to group_by() or aggregate() your data using column b

Using the data provided in the question

#original data
df <- structure(list(Date = structure(c(17534, 17535, 17536, 17537, 
                                        18279, 18280, 18281, 18282, 17932), class = "Date"), group = c(1, 
                                                                                                     1, 1, 1, 2, 2, 2, 2, 2)), row.names = c(NA, -9L), class = c("tbl_df", 
                                                                                                                                                                   "tbl", "data.frame"))

#plus extra step
df$group2 <- 1 + c(0, cumsum(ifelse(diff(df$Date) > 1, 1, 0)))

Method described above

date_sequence <- seq.Date(from = min(df$Date),max(df$Date),by="day")
groups <- rep(1:length(date_sequence),each= 7, length.out = length(date_sequence) )
date_grouping_index <- data.frame(a=date_sequence,groups=groups)

example_df2<- df %>% 
  inner_join(date_grouping_index, by=c("Date"="a"))

Looks like it worked?

example_df2
# A tibble: 9 x 4
  Date       group group2 groups
  <date>     <dbl>  <dbl>  <int>
1 2018-01-03     1      1      1
2 2018-01-04     1      1      1
3 2018-01-05     1      1      1
4 2018-01-06     1      1      1
5 2020-01-18     2      2    107
6 2020-01-19     2      2    107
7 2020-01-20     2      2    107
8 2020-01-21     2      2    107
9 2019-02-05     2      2     57

Here's something you could do to make group names with the date and year in them:

example_df2$group_name <- paste0("sampling number ",
                                example_df2$groups,
                                " (",
                                month.name[month(example_df2$Date)],
                                "-",
                                year(example_df2$Date),
                                ")")
Russ
  • 1,385
  • 5
  • 17
  • It nearly worked, but my second bit of code still doesn't know how to group them. It grouped 2018-xx-xx to the next date 2013-xx-xx, probably because it's only looking at the day part...not sure. – Nate Apr 16 '22 at 01:24
  • The date is weird too, because I don't see the time part in the df. It only appears when I posted the first few rows of data here with head(). – Nate Apr 16 '22 at 01:27
  • I think I see the problem. The dates are not in equally divided into "batches"; some are 10 days in a row, some are 7. To give some background they are days out in the field, so it varies greatly by year and month. I'll accept the answer, but if there's a way to do this with a kind of random number of dates in a row, that's what I'm aiming for. Thanks for the help! – Nate Apr 16 '22 at 03:16
  • If I'm understanding what you want, you could probably create group names based off of month and year using `paste(month.name[month(example_df2$Date)], year(example_df2$Date, sep = "_")` similar to the group names thing at the very end. And then aggregate or group_by or whatever. Or some custom length using `seq.Date()` it accepts all sorts of things for by such as `by="2 days"` – Russ Apr 16 '22 at 03:23
  • I'm hoping to create a new factor (or character string) column in my dataframe that helps me identify dates that occur in sequence in my "date" column. Sometimes the dataframe indicates 5 days in a row, sometimes 7, but these dates could be months apart. For example, Jan. 1st to Jan.15th could be group 1, and group 2 could be March 18th to April 1st. Does this help? – Nate Apr 16 '22 at 03:33
  • Dates with the same month and days in sequence (1,2,3,4..), but different years shouldn't be grouped together, for example. – Nate Apr 16 '22 at 03:34
  • Sorry, I meant to say these groups associated with the sequential dates could be months/years apart. – Nate Apr 16 '22 at 03:40
  • 2022-01-01 and 2022-01-02 are in the same group, but 2022-01-01 and 2021-01-02 are not and neither are 2022-01-01 and 2022-01-03 (because at least 1 day separates them). – Nate Apr 16 '22 at 03:44