1

I have a sequence of dates (years) that is irregular.

Specifically, year 2004 is followed by 2005, 2006 is missing, 2007 is present, followed by 2008, then sequence is missing years until 2014.

# data input
df_in <- 
  data.frame(seq = c(2004L, 2005L, 2007L, 2008L, 2014L, 2015L, 2016L))

# desired result
df_out <- 
  data.frame(df_in, grp = c(1L, 1L, 2L, 2L, 3L, 3L, 3L))

   seq grp
1 2004   1
2 2005   1
3 2007   2
4 2008   2
5 2014   3
6 2015   3
7 2016   3

I would like to find a way to generate groups of years that are next to each other. So, group 1 would contain years 2004 and 2005, group 2 years 2007 and 2008, and group 3 years from 2014 to 2016.

Any help would be appreciated.

Miha Trošt
  • 2,002
  • 22
  • 25

3 Answers3

1

How about:

df_in$group = 1 + c(0, cumsum(ifelse(diff(df_in$seq) > 1, 1, 0)))

The idea here is that diff calculates the lagged difference. When it's more than 1, we add one to the group. cumsum calculates the cumulative sum of those times we've encountered a gap, aka a new group. The c(0, is there because the output of diff is one shorter than our data, and we need a value for the first element. Finally, the 1 + is just for optics, so the first group is 1 instead of 0.

> df_in$group 
[1] 1 1 2 2 3 3 3
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
1
cumsum(c(1, diff(df_in$seq)) != 1) + 1
[1] 1 1 2 2 3 3 3
s_baldur
  • 29,441
  • 4
  • 36
  • 69
0

This is the best I could come up with. But I'd be great if someone else has a more elegant solution:

df_in <- data.frame(seq = c(2004L, 2005L, 2007L, 2008L, 2014L, 2015L, 2016L))

Define maximal distance between elements within a group:

max_range_within_group <- 1

Calculate existing distances:

diffs <- df_in$seq[-1] - df_in$seq[-length(df_in$seq)]

Iterate trough distances and check if they are within 'allowed' distance or increase grp by 1:

grp <- 1
for (diff in diffs) {
  nextGrp <- if (diff <= max_range_within_group) {
    grp[length(grp)]
  } else {
    grp[length(grp)] + 1
  }
  grp <- c(grp, nextGrp)
}

Bind grp to data.frame:

df_in$grp <- grp

This returns:

   seq grp
1 2004   1
2 2005   1
3 2007   2
4 2008   2
5 2014   3
6 2015   3
7 2016   3
dario
  • 6,415
  • 2
  • 12
  • 26