In R, split a dataframe so subset dataframes contain last row of previous dataframe and first row of subsequent dataframe

Question

There are many answers for how to split a dataframe, for example How to split a data frame?

However, I'd like to split a dataframe so that the smaller dataframes contain the last row of the previous dataframe and the first row of the following dataframe.

Here's an example

n <- 1:9
group <- rep(c("a","b","c"), each = 3)
data.frame(n = n, group)

  n  group
1 1     a
2 2     a
3 3     a
4 4     b
5 5     b
6 6     b
7 7     c
8 8     c
9 9     c

I'd like the output to look like:

 d1 <- data.frame(n = 1:4, group = c(rep("a",3),"b"))
 d2 <- data.frame(n = 3:7, group = c("a",rep("b",3),"c"))
 d3 <- data.frame(n = 6:9, group = c("b",rep("c",3)))
 d <- list(d1, d2, d3)
 d

[[1]]
  n group
1 1     a
2 2     a
3 3     a
4 4     b

[[2]]
  n group
1 3     a
2 4     b
3 5     b
4 6     b
5 7     c

[[3]]
  n group
1 6     b
2 7     c
3 8     c
4 9     c

What is an efficient way to accomplish this task?

G. Grothendieck · Accepted Answer · 2015-11-19T02:10:01.597

Suppose DF is the original data.frame, the one with columns n and group. Let n be the number of rows in DF. Now define a function extract which given a sequence of indexes ix enlarges it to include the one prior to the first and after the last and then returns those rows of DF. Now that we have defined extract, split the vector 1, ..., n by group and apply extract to each component of the split.

n <- nrow(DF)
extract <- function(ix) DF[seq(max(1, min(ix) - 1), min(n, max(ix) + 1)), ]
lapply(split(seq_len(n), DF$group), extract)

$a
  n group
1 1     a
2 2     a
3 3     a
4 4     b

$b
  n group
3 3     a
4 4     b
5 5     b
6 6     b
7 7     c

$c
  n group
6 6     b
7 7     c
8 8     c
9 9     c

The named list is handy here, too. – Tedward Nov 18 '15 at 21:08 — Tedward, Nov 18 '15 at 21:08

Henrik · Answer 2 · 2015-11-19T09:30:11.937

Or why not try good'ol by, which "[a]ppl[ies] a Function to a Data Frame Split by Factors [INDICES]".

by(data = df, INDICES = df$group, function(x){
   id <- c(min(x$n) - 1, x$n, max(x$n) + 1)
   na.omit(df[id, ])
   })


# df$group: a
#   n group
# 1 1     a
# 2 2     a
# 3 3     a
# 4 4     b
# -------------------------------------------------------------------------------- 
#   df$group: b
# n group
# 3 3     a
# 4 4     b
# 5 5     b
# 6 6     b
# 7 7     c
# -------------------------------------------------------------------------------- 
#   df$group: c
#   n group
# 6 6     b
# 7 7     c
# 8 8     c
# 9 9     c

Although the print method of by creates a 'fancy' output, the (default) result is a list, with elements named by the levels of the grouping variable (just try str and names on the resulting object).

score 3 · Answer 3 · answered Nov 18 '15 at 20:52

I was going to comment under @cdetermans answer but its too late now. You can generalize his approach using data.table::shift (or dyplr::lag) in order to find the group indices and then run a simple lapply on the ranges, something like

library(data.table) # v1.9.6+ 
indx <- setDT(df)[, which(group != shift(group, fill = TRUE))]
lapply(Map(`:`, c(1L, indx - 1L), c(indx, nrow(df))), function(x) df[x,])
# [[1]]
#    n group
# 1: 1     a
# 2: 2     a
# 3: 3     a
# 4: 4     b
# 
# [[2]]
#    n group
# 1: 3     a
# 2: 4     b
# 3: 5     b
# 4: 6     b
# 5: 7     c
# 
# [[3]]
#    n group
# 1: 6     b
# 2: 7     c
# 3: 8     c
# 4: 9     c

mlegge · Answer 4 · 2015-11-18T21:51:05.313

Could be done with data.frame as well, but is there ever a reason not to use data.table? Also this has the option to be executed with parallelism.

library(data.table)
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
df <- data.table(n = n, group)
df[, `:=` (group = factor(df$group))]
df[, `:=` (group_i = seq_len(.N), group_N = .N), by = "group"]

library(doParallel)
groups <- unique(df$group)
foreach(i = seq(groups)) %do% {
  df[group == groups[i] | (as.integer(group) == i + 1 & group_i == 1) | (as.integer(group) == i - 1 & group_i == group_N), c("n", "group"), with = FALSE]  
}
[[1]]
   n group
1: 1     a
2: 2     a
3: 3     a
4: 4     b
[[2]]
   n group
1: 3     a
2: 4     b
3: 5     b
4: 6     b
5: 7     c
[[3]]
   n group
1: 6     b
2: 7     c
3: 8     c
4: 9     c

bramtayl · Answer 5 · 2015-11-18T22:42:21.470

0

Here is another dplyr way:

library(dplyr)

data = 
  data_frame(n = n, group) %>%
  group_by(group)

firsts = 
  data %>%
  slice(1) %>%
  ungroup %>%
  mutate(new_group = lag(group)) %>%
  slice(-1)

lasts = 
  data %>%
  slice(n()) %>%
  ungroup %>%
  mutate(new_group = lead(group)) %>%
  slice(-n())

bind_rows(firsts, data, lasts) %>%
  mutate(final_group = 
           ifelse(is.na(new_group),
                  group,
                  new_group) ) %>%
  arrange(final_group, n) %>%
  group_by(final_group)

edited Nov 18 '15 at 22:42

answered Nov 18 '15 at 22:27

bramtayl

4,004
2
11
18

Thanks for the dplyr way. It improves on the similar method I used to solve the question that inspired this one: http://stackoverflow.com/questions/33787624/in-r-is-it-possible-to-include-the-same-row-in-multiple-groups-or-is-there-oth. Also have you seen 'lead' and 'lag'? https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html – Tedward Nov 18 '15 at 22:37
Yeah, I always forget which packages have lag and lead. I've added them in above. – bramtayl Nov 18 '15 at 22:42

In R, split a dataframe so subset dataframes contain last row of previous dataframe and first row of subsequent dataframe

5 Answers5

Linked