Two basic approaches: (1) do a yearly sequence, conditioning on whether the start is before and if the end is after June (06
), used in the base R approach; (2) do a monthly sequence, conditioning (group-filtering) each year on whether "06"
is in one of the months. They both work, which you choose depends on comfort/preference.
base R
Since this example uses "sequence by year", I'll use the fact that POSIXlt
(notice the l
, as opposed to the more common POSIXct
) is really a list with elements for year, month, etc:
dput(as.POSIXlt(Sys.time()))
# structure(list(sec = 21.7977600097656, min = 30L, hour = 8L,
# mday = 2L, mon = 5L, year = 123L, wday = 5L, yday = 152L,
# isdst = 1L, zone = "EDT", gmtoff = -14400L), class = c("POSIXlt",
# "POSIXt"), tzone = c("", "EST", "EDT"))
With that, we can easily $
-index the list and compare the mon
th as a value. NOTE: ?POSIXlt
shows that mon
is ‘mon’ 0-11: months after the first of the year
, meaning that June is 5 in 0-based months. (Ergo the use of 5
below.)
fun <- function(sd, ed) {
sdlt <- as.POSIXlt(sd)
edlt <- as.POSIXlt(ed)
if (sdlt$mon > 5) sdlt$year <- sdlt$year+1
if (edlt$mon < 5) edlt$year <- edlt$year-1
if (sdlt <= edlt) unique(format(seq.Date(as.Date(sdlt), as.Date(edlt), by = "year"), format="%Y"))
}
years <- Map(fun, mydata$startdate, mydata$enddate)
str(years)
# List of 5
# $ : chr [1:5] "2008" "2009" "2010" "2011" ...
# $ : chr [1:4] "2008" "2009" "2010" "2011"
# $ : chr "2012"
# $ : chr "2012"
# $ : NULL
out <- data.frame(
id = rep(mydata$id, lengths(years)),
statenumber = rep(mydata$statenumber, lengths(years)),
year = unlist(years))
out
# id statenumber year
# 1 R007 1 2008
# 2 R007 1 2009
# 3 R007 1 2010
# 4 R007 1 2011
# 5 R007 1 2012
# 6 R008 2 2008
# 7 R008 2 2009
# 8 R008 2 2010
# 9 R008 2 2011
# 10 R008 3 2012
# 11 R009 4 2012
If there's a chance to have duplicates (repeated states for an id that gap within the same year), then you can use unique(out)
.
Map
just "zips" the data together. The individual calls to fun
unrolled would look like this:
list(
fun(mydata$startdate[1], mydata$enddate[1]),
fun(mydata$startdate[2], mydata$enddate[2]),
fun(mydata$startdate[3], mydata$enddate[3]),
fun(mydata$startdate[4], mydata$enddate[4]),
fun(mydata$startdate[5], mydata$enddate[5])
)
The inclusion of ed
on the end of c(seq.Date(..), ed)
is to guard against the fact that seq
may not include the year of the enddate
. In this case, it ensures that R008
in state 2
sees 2012
.
dplyr
In this (and the data.table
) section, we'll use the monthly sequence instead, using format="%m"
as the month. Dissimilar from POSIXlt
above (June is 5
), reading ?%strptime
for the %
-codes defines %m
as ‘%m’ Month as decimal number (01-12)
, so June is back to "06"
.
library(dplyr)
mydata %>%
rowwise() %>%
summarize(
id, statenumber,
dates = seq(startdate, enddate, by = "month"),
year = format(dates, format = "%Y")) %>%
group_by(id, statenumber, year) %>%
filter(any(format(dates, format = "%m") == "06")) %>%
distinct(id, statenumber, year) %>%
ungroup()
# # A tibble: 11 × 3
# id statenumber year
# <chr> <dbl> <chr>
# 1 R007 1 2008
# 2 R007 1 2009
# 3 R007 1 2010
# 4 R007 1 2011
# 5 R007 1 2012
# 6 R008 2 2008
# 7 R008 2 2009
# 8 R008 2 2010
# 9 R008 2 2011
# 10 R008 3 2012
# 11 R009 4 2012
I generally try to avoid rowwise
when able, but this problem does need to be executed one row at a time (which is effectively what Map
is doing in the base R solution above).
data.table
library(data.table)
as.data.table(mydata)[, .(id, statenumber, dates = seq(startdate, enddate, by = "month")), by = .(seq(nrow(mydata)))
][, year := format(dates, format="%Y")
][, .SD[any(format(dates, format="%m") == "06"),], by = .(id, statenumber, year)
][, c("seq", "dates") := NULL
][, unique(.SD)]
# id statenumber year
# <char> <num> <char>
# 1: R007 1 2008
# 2: R007 1 2009
# 3: R007 1 2010
# 4: R007 1 2011
# 5: R007 1 2012
# 6: R008 2 2008
# 7: R008 2 2009
# 8: R008 2 2010
# 9: R008 2 2011
# 10: R008 3 2012
# 11: R009 4 2012