1

I have a dataframe with subjects and years, for each year there is only one subject. I want to create timespanes from the categories when the dates are continuous:

cat   <- c("Cat1","Cat1","Cat2","Cat2","Cat2","Cat3","Cat2","Cat2","Cat2")
year <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
df <- data.frame(Cat=cat, Year=year)
#  Which looks like the following:
# Cat1 2010
# Cat1 2011
# Cat2 2012
# Cat2 2013
# Cat2 2014
# Cat3 2015
# Cat2 2016
# Cat2 2017
# Cat2 2018

What I want as the output is a dataframe like this:

cat   <- c("Cat1","Cat2","Cat3","Cat2")
year <- c(2010,2012,2015,2016)
e_year <- c(2011,2014,2015,2018)
df_goal <- data.frame(Cat=cat, Year=year, EYear = e_year)
#  Cat Year EYear
# Cat1 2010  2011
# Cat2 2012  2014
# Cat3 2015  2015
# Cat2 2016  2018

I thought of doing it with a loop but I dont think thats the proper way to do it in R. So I wanted to ask before I spend time on that solution.

r2evans
  • 141,215
  • 6
  • 77
  • 149

2 Answers2

2

This is strongly related to both Calculate the mean by group (summarize by group) and Get first and last value from groups using rle, though slightly different from each.

base R

out <- aggregate(Year ~ Cat + grp, data = df, FUN = range)
out <- do.call(cbind.data.frame, out[,-2])
names(out)[2:3] <- c("Year", "EYear")
out
#    Cat Year EYear
# 1 Cat1 2010  2011
# 2 Cat2 2012  2014
# 3 Cat3 2015  2015
# 4 Cat2 2016  2018

dplyr

library(dplyr)
df %>%
  group_by(grp = cumsum(Cat != lag(Cat, default = ""))) %>%
  summarize(Cat = Cat[1], EYear = max(Year), Year = min(Year)) %>%
  ungroup() %>%
  select(-grp)
# # A tibble: 4 x 3
#   Cat   EYear  Year
#   <chr> <dbl> <dbl>
# 1 Cat1   2011  2010
# 2 Cat2   2014  2012
# 3 Cat3   2015  2015
# 4 Cat2   2018  2016

data.table

library(data.table)
as.data.table(df)[, .(Cat = Cat[1], EYear = max(Year), Year = min(Year)), by = .(grp = rleid(Cat))
  ][, grp := NULL]
#       Cat EYear  Year
#    <char> <num> <num>
# 1:   Cat1  2011  2010
# 2:   Cat2  2014  2012
# 3:   Cat3  2015  2015
# 4:   Cat2  2018  2016
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Excellent answers ;) Just one question: Wouldn't we want to `arrange` first by Year, since all solutions rely on this ordering? – dario Oct 15 '21 at 11:54
  • 2
    That depends on the OP: I'm interpreting the order of `Cat` to be paramount, and from there it's just min/max of the year. If instead they intend first/last, then we'll need to reconsider. (Further, the mention of "continuous" suggests we'll need more logic in the "rle" logic, though the data doesn't reflect a need for that.) – r2evans Oct 15 '21 at 11:56
  • 1
    Your answer was exactly what i was looking for, thanks alot. My data was ordered by year already. Continuous probably wasnt the best wording just meant that the same `Cat` value is there over the span of multiple rows/years. – user16543787 Oct 15 '21 at 13:42
1

With dplyr:

df %>% group_by(Cat, N=cumsum(Cat != lag(Cat, default=""))) %>%
  summarize(SYear=min(Year), EYear=max(Year)) %>%
  arrange(N) %>% select(-N)

Output:

  Cat   SYear EYear
  <chr> <dbl> <dbl>
1 Cat1   2010  2011
2 Cat2   2012  2014
3 Cat3   2015  2015
4 Cat2   2016  2018
U13-Forward
  • 69,221
  • 14
  • 89
  • 114