Cleaning data in R by using a reference date

Question

Consider the following dataset

mydata<-data.frame(id = c("R007", "R008", "R008", "R009", "R009"),
                   statenumber= c(1, 2, 3, 4, 5),
                   startdate = c(20080101, 20080101, 20120301,20120101, 20121001),
                   enddate = c(20121201, 20120301, 20121201, 20121001, 20121201))

#if necessary we can write the dates as dates
mydata$startdate<-as.Date(as.character(mydata$startdate), "%Y%m%d")
mydata$enddate<-as.Date(as.character(mydata$enddate), "%Y%m%d")

The explanation of the dataset is as follows. A person with id number R007 lives during 2008 and 2012 at the same address. Person R008 lives during 2008 and 2012 at two addresses (state number). So for the years 2008-2011, he lives at address 2, and for the year 2012 he lives at address 3. Person R009 is only available in the file since 2012. During the first year of 2012 he lives at address 4 and in the last two months he lives at address 5.

Now I want to rearrange this dataset such that I have one address line per person per year. I would like to do this by making use of a reference month (say June). In other words, if a person lives in June at a particular address, I assign that address to him for the entire year. The dataset would then look as follows

endresult<-data.frame(id = c("R007", "R007","R007","R007","R007",
                             "R008", "R008", "R008", "R008","R008", "R009"),
                      statenumber = c(1,1,1,1,1,2,2,2,2,3,4),
                      year = c(2008, 2009, 2010, 2011, 2012,
                               2008,2009,2010,2011,2012, 2012))

Does anyone know how to get to this endresult dataset? I appreciate any help.

If ID R009 doesn't get assigned the year 2013, should ID R007 get assigned the year 2008 as they have records only for the last 3 months of 2008? — NicChr, Jun 02 '23 at 11:00
Sorry @NicChr, I made a typo in the first two rows. The startdate should be 2008-01-01 and not 2008-10-01. I corrected the mistake in th above question. Sorry for the inconvenience. — Albert, Jun 02 '23 at 11:54

r2evans · Accepted Answer · 2023-06-02T12:32:05.697

Two basic approaches: (1) do a yearly sequence, conditioning on whether the start is before and if the end is after June (06), used in the base R approach; (2) do a monthly sequence, conditioning (group-filtering) each year on whether "06" is in one of the months. They both work, which you choose depends on comfort/preference.

base R

Since this example uses "sequence by year", I'll use the fact that POSIXlt (notice the l, as opposed to the more common POSIXct) is really a list with elements for year, month, etc:

dput(as.POSIXlt(Sys.time()))
# structure(list(sec = 21.7977600097656, min = 30L, hour = 8L, 
#     mday = 2L, mon = 5L, year = 123L, wday = 5L, yday = 152L, 
#     isdst = 1L, zone = "EDT", gmtoff = -14400L), class = c("POSIXlt", 
# "POSIXt"), tzone = c("", "EST", "EDT"))

With that, we can easily $-index the list and compare the month as a value. NOTE: ?POSIXlt shows that mon is ‘mon’ 0-11: months after the first of the year, meaning that June is 5 in 0-based months. (Ergo the use of 5 below.)

fun <- function(sd, ed) {
  sdlt <- as.POSIXlt(sd)
  edlt <- as.POSIXlt(ed)
  if (sdlt$mon > 5) sdlt$year <- sdlt$year+1
  if (edlt$mon < 5) edlt$year <- edlt$year-1
  if (sdlt <= edlt) unique(format(seq.Date(as.Date(sdlt), as.Date(edlt), by = "year"), format="%Y"))
}
years <- Map(fun, mydata$startdate, mydata$enddate)
str(years)
# List of 5
#  $ : chr [1:5] "2008" "2009" "2010" "2011" ...
#  $ : chr [1:4] "2008" "2009" "2010" "2011"
#  $ : chr "2012"
#  $ : chr "2012"
#  $ : NULL
out <- data.frame(
  id = rep(mydata$id, lengths(years)), 
  statenumber = rep(mydata$statenumber, lengths(years)), 
  year = unlist(years))
out
#      id statenumber year
# 1  R007           1 2008
# 2  R007           1 2009
# 3  R007           1 2010
# 4  R007           1 2011
# 5  R007           1 2012
# 6  R008           2 2008
# 7  R008           2 2009
# 8  R008           2 2010
# 9  R008           2 2011
# 10 R008           3 2012
# 11 R009           4 2012

If there's a chance to have duplicates (repeated states for an id that gap within the same year), then you can use unique(out).

Map just "zips" the data together. The individual calls to fun unrolled would look like this:

list(
  fun(mydata$startdate[1], mydata$enddate[1]),
  fun(mydata$startdate[2], mydata$enddate[2]),
  fun(mydata$startdate[3], mydata$enddate[3]),
  fun(mydata$startdate[4], mydata$enddate[4]),
  fun(mydata$startdate[5], mydata$enddate[5])
)

The inclusion of ed on the end of c(seq.Date(..), ed) is to guard against the fact that seq may not include the year of the enddate. In this case, it ensures that R008 in state 2 sees 2012.

dplyr

In this (and the data.table) section, we'll use the monthly sequence instead, using format="%m" as the month. Dissimilar from POSIXlt above (June is 5), reading ?%strptime for the %-codes defines %m as ‘%m’ Month as decimal number (01-12), so June is back to "06".

library(dplyr)
mydata %>%
  rowwise() %>%
  summarize(
    id, statenumber, 
    dates = seq(startdate, enddate, by = "month"), 
    year = format(dates, format = "%Y")) %>%
  group_by(id, statenumber, year) %>%
  filter(any(format(dates, format = "%m") == "06")) %>%
  distinct(id, statenumber, year) %>%
  ungroup()
# # A tibble: 11 × 3
#    id    statenumber year 
#    <chr>       <dbl> <chr>
#  1 R007            1 2008 
#  2 R007            1 2009 
#  3 R007            1 2010 
#  4 R007            1 2011 
#  5 R007            1 2012 
#  6 R008            2 2008 
#  7 R008            2 2009 
#  8 R008            2 2010 
#  9 R008            2 2011 
# 10 R008            3 2012 
# 11 R009            4 2012

I generally try to avoid rowwise when able, but this problem does need to be executed one row at a time (which is effectively what Map is doing in the base R solution above).

data.table

library(data.table)
as.data.table(mydata)[, .(id, statenumber, dates = seq(startdate, enddate, by = "month")), by = .(seq(nrow(mydata)))
  ][, year := format(dates, format="%Y")
  ][, .SD[any(format(dates, format="%m") == "06"),], by = .(id, statenumber, year)
  ][, c("seq", "dates") := NULL
  ][, unique(.SD)]
#         id statenumber   year
#     <char>       <num> <char>
#  1:   R007           1   2008
#  2:   R007           1   2009
#  3:   R007           1   2010
#  4:   R007           1   2011
#  5:   R007           1   2012
#  6:   R008           2   2008
#  7:   R008           2   2009
#  8:   R008           2   2010
#  9:   R008           2   2011
# 10:   R008           3   2012
# 11:   R009           4   2012

Thank you for your answer. I indeed do not want to have duplicates, but your solution does not solve my problem entirely. As inidicated in the question, I would like to have the month June to be the leading indicator for the statenumber (or address). So this would mean that individual R008 should only have the statenumber 3 and individual R009 only the statenumber 4 (as both live in june of a particular year at that address). — Albert, Jun 02 '23 at 12:06
I have one more question. When I perform your code on my real dataset I get the following error: "error in seq. data. from must be of length 1. Any idea what this error is and how I can solve it. I tried looking here, but I do not see how it apllies to this code https://stackoverflow.com/questions/46171483/how-should-i-deal-with-from-must-be-of-length-1-error — Albert, Jun 09 '23 at 13:32
I can trigger that error only when `startdate` is length-0, which seems unlikely in a dplyr pipe (or `Map` or `data.table`). No other ideas. — r2evans, Jun 09 '23 at 13:42

score 1 · Answer 2 · answered Jun 02 '23 at 12:13

Perhaps not the prettiest solution but would this work?

library(tidyverse)
library(tidyr)
mydata %>%
  mutate(yearmonth = startdate) %>%
  group_by(id, statenumber) %>%
  complete(yearmonth = seq(startdate, enddate, by = "month")) %>%
  arrange(id, yearmonth) %>%
  mutate(year = year(yearmonth),
         month = month(yearmonth)) %>%
  group_by(id) %>%
  fill(statenumber, .direction = "down") %>%
  mutate(year_residence = if_else(month == 6L, year, NA_integer_)) %>%
  group_by(id, year, statenumber) %>%
  fill(year_residence, .direction = "updown") %>%
  ungroup() %>%
  distinct(id, statenumber, year_residence) %>%
  filter(!is.na(year_residence))

Cleaning data in R by using a reference date

2 Answers2

base R

dplyr

data.table