subsetting df with repeated sequences

Question

I have searched high and low for a solution to this, but I cannot find one.....

My dataframe (essentially a table of the no. 1 sports team by date) has numerous occasions where one or various teams would "reappear" in the data. I want to pull out the start (or end) date of each period at no. 1 per team.

An example of the data could be:

x1<- as.Date("2013-12-31")
adddate1 <- 1:length(teams1)
dates1 <- x1 + adddate1
teams2 <- c(rep("w", 3), rep("c", 8), rep("w", 4))
x2<- as.Date("2012-12-31")
adddate2 <- 1:length(teams2)
dates2 <- x2 + adddate2
dates <- c(dates2, dates1)
teams <- c(teams2, teams1)
df <- data.frame(dates, teams)
df$year <- year(df$dates)

which for 2013 looks like:

        dates teams year
1  2013-01-01     w 2013
2  2013-01-02     w 2013
3  2013-01-03     w 2013
4  2013-01-04     c 2013
5  2013-01-05     c 2013
6  2013-01-06     c 2013
7  2013-01-07     c 2013
8  2013-01-08     c 2013
9  2013-01-09     c 2013
10 2013-01-10     c 2013
11 2013-01-11     c 2013
12 2013-01-12     w 2013
13 2013-01-13     w 2013
14 2013-01-14     w 2013
15 2013-01-15     w 2013

However, using ddply aggregates the identically-named teams and returns the following:

split <- ddply(df, .(year, teams), head,1)
split <- split[order(split[,1]),]

       dates teams year
2 2013-01-01     w 2013
1 2013-01-04     c 2013
3 2014-01-01     c 2014
4 2014-01-09     k 2014

Is there a more elegant way to do this than creating a function which would go through the original df and return a unique value for each subset, add this to the df and then use ddply incorporating the new unique value to return what I want?

Does `aggregate(df$dates,by=df[,c("teams","year")],min)` work? — nicola, Nov 29 '14 at 13:16

score 4 · Accepted Answer · edited May 23 '17 at 12:12

You say some teams "reappear" and at that point I thought the little intergroup helper function from this answer might be just the right tool here. It is useful when in your case, there are teams e.g. "w" that reappear in the same year, e.g. 2013, after another team has been there for some time, e.g. "c". Now if you want to treat each sequence of occurence per team as separate groups in order to get the first or last date of that sequence, that when this function is useful. Note that if you only group by "team" and "year" as you would normally do, each team, e.g. "w" could only have one first/last date (for example when using "summarise" in dplyr).

Define the function:

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

Now group your data first by year and then additionally by using the intergroup function on the teams column:

library(dplyr)
df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  filter(dense_rank(dates) == 1)

Finally, you can filter according to your needs. Here for example, I filter the min dates. The result would be:

#Source: local data frame [3 x 4]
#Groups: year, teamindex
#
#       dates teams year teamindex
#1 2013-01-01     w 2013         1
#2 2013-01-04     c 2013         2
#3 2013-01-12     w 2013         3

Note that team "w" reappears because we grouped by "teamindex" which we created by using intergroup function.

Another option to do the filtering is like this (using arrange and then slice):

df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  arrange(dates) %>%
  slice(1)

The data I used is from akrun's answer.

Thanks again for your help beginneR. The first option of your response works perfectly on my dataset. For the second option, what package contains the slice function ? I have dplyr and plyr loaded but it is not recognized....... — laoisman, Nov 29 '14 at 20:48
Very welcome, @laoisman. The slice function is also from dplyr, but perhaps you need to update to the latest version on CRAN. Should be 0.3.0.2. And take care of function name conflicts when you have plyr and dplyr loaded at the same time. — talat, Nov 29 '14 at 20:52

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

You could also use rle to create the teamindex.

library(dplyr)
 df %>% 
    group_by(year) %>% 
    group_by(teamindex= with(rle(teams),
          rep(seq_along(lengths), lengths)), add=TRUE) %>%
          filter(dates==min(dates)) #or #filter(dates==max(dates))

 #        dates teams year teamindex
 #1 2013-01-01     w 2013         1
 #2 2013-01-04     c 2013         2
 #3 2013-01-12     w 2013         3

Or

df %>% 
   group_by(year) %>%
   group_by(teamindex= with(rle(teams),
      rep(seq_along(lengths), lengths)), add=TRUE) %>%
   arrange(dates) %>%
   slice(n()) #or #slice(1)
 #       dates teams year teamindex
 #1 2013-01-03     w 2013         1
 #2 2013-01-11     c 2013         2
 #3 2013-01-15     w 2013         3

data

df <- structure(list(dates = structure(c(15706, 15707, 15708, 15709, 
15710, 15711, 15712, 15713, 15714, 15715, 15716, 15717, 15718, 
15719, 15720), class = "Date"), teams = c("w", "w", "w", "c", 
"c", "c", "c", "c", "c", "c", "c", "w", "w", "w", "w"), year = c(2013L, 
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 
2013L, 2013L, 2013L, 2013L, 2013L)), .Names = c("dates", "teams", 
"year"), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15"), class = "data.frame")

Perhaps I hadn't explained my problem very clearly ;-). Both of these return similar output to the ddply call using my dataset. The intergroup helper function as proposed by beginneR seems to make the difference here. Thanks anyways. — laoisman, Nov 29 '14 at 21:37
@laoisman Yes, I was a bit confused by `Is there a more elegant way to do this`. I updated the answer using `rle`. — akrun, Nov 30 '14 at 04:22

subsetting df with repeated sequences

2 Answers2

data

Linked