0

I have a variable that provides miscellaneous dates. I want to summarize these so they can be factored before being used in a predictive model.

I would like to do group the dates by the following:

  • This Year (this calendar year)
  • Last Year
  • Over 3 Years Ago

I'm pretty new to R so any help on this would be much appreciated. Thank you

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Sebastian Hubard
  • 163
  • 1
  • 4
  • 18
  • Please add data using `dput` or something that we can copy and use. Also show expected output for the same. Read about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Oct 27 '20 at 01:07
  • If your date column is `Date` or `POSIX` class (as it should be), you can use `cut` as you would with integers - [see this FAQ about binning data](https://stackoverflow.com/q/5570293/903061). – Gregor Thomas Oct 27 '20 at 02:49

1 Answers1

1

As other commenters have noted, you haven't supplied any data or a reproducible example, but let's give this a go anyway.

I'll be using two tidyverse packages, dplyr and lubridate, to help us out.

For present purposes, let's start by generating some random dates and put these into a dataframe/tibble. I'm assuming your dates are already within a dataframe in the right class, as Gregor pointed out above.

data <- tibble(date = sample(seq(as.Date('2015-01-01'), as.Date('2020-12-31'), by="day"), 50))

Let's now use dplyr and lubridate to recode the dates into a new variable, date_group:

data %>%
  mutate(date_group = factor(
    case_when(
      year(date) == year(today()) ~ "This Year",
      year(date) == year(today()) - 1 ~ "Last Year",
      year(date) < today() - years(3) ~ "Over 3 Years Ago",
      TRUE ~ "Other"
    )
  ))

For the first two groups, we apply use the lubridate function year() (which extracts the year from a date) to the date column in data, and compare this against the year extracted from today's date (using today()).

For dates over 3 years ago, we subtract 3 years from today's date (noting that this is different from the calendar-year based calculations for this year and last year) using years().

Of course, this leaves a gap for dates less than 3 years ago but more than 1 calendar year ago. We have a default option in the case_when function to specify this as "Other".

We wrap the result of the case_when function in factor() so that the resulting groups are treated as a factor rather than a string ready for subsequent modelling.

The case_when function is useful (and easy to read) if you have just a few categories. Too many and it gets too messy and you should think about another way to restructure your data.

semaphorism
  • 836
  • 3
  • 13