tidy functional way of interweaving rows of a data frame

Question

I have written the code below to take two dataframes and interweave them by row based on this example. I believe this is using Bresenham's line algorithm which evenly disperses the shorter dataframe within the longer one.

interleave_rows <- function(x, y) {
  m <- nrow(x)
  
  yi <- 1
  len <- m + nrow(y)
  err <- len %/% 2
  res <- x
  
  for (i in 1:len) {
    err <- err - m
    if (err < 0) { err <- err + len } else {
      res <- add_row(res, !!! slice(y, yi), .before = i)
      yi <- yi + 1
    }
  }
  res
}

l <- list(
  a = tibble(n = 1:3, l = letters[1:3]),
  b = tibble(n = 4:9, l = letters[4:9]),
  c = tibble(n = 10:11, l = letters[10:11])
)

reduce(l, interleave_rows)

I'm using this in a shiny app as part of a reduce and it's a little slow. I also don't find this to be a very tidy or functional approach to solving this problem. I haven't been able to wrap my head around how to do this without the loop and reassigning variables but I suspect doing so would be faster. Is there a better approach?

Allan Cameron · Accepted Answer · 2020-07-30T21:41:34.577

I think the problem with your function is that it inserts one row at a time to the data frame. It would be better to create interleaving indices, call rbind once, and subset by the indexes.

This function works by effectively calculating the quantile of the row number within each data frame, then finding the order of the quantiles:

interleave_rows <- function(df_a, df_b)
{
  if(nrow(df_b) > nrow(df_a)) return(interleave_rows(df_b, df_a))
  a <- seq(nrow(df_a))
  b <- seq(nrow(df_b))
  
  rbind(df_a, df_b)[order(c(a, length(a) * b/(length(b) + 1))), ]
}

You can see how this works clearly with these two data frames:

df_a <- data.frame(came_from = rep("A", 10), value = 1:10)
df_b <- data.frame(came_from = rep("B", 4),  value = 101:104)

interleave_rows(df_a, df_b)
#>    came_from value
#> 1          A     1
#> 2          A     2
#> 11         B   101
#> 3          A     3
#> 4          A     4
#> 12         B   102
#> 5          A     5
#> 6          A     6
#> 13         B   103
#> 7          A     7
#> 8          A     8
#> 14         B   104
#> 9          A     9
#> 10         A    10

And on your own data you would get:

l <- list(
  a = tibble(n = 1:3, l = letters[1:3]),
  b = tibble(n = 4:9, l = letters[4:9]),
  c = tibble(n = 10:11, l = letters[10:11])
)

reduce(l, interleave_rows)
#> # A tibble: 11 x 2
#>        n l    
#>    <int> <chr>
#>  1     4 d    
#>  2     1 a    
#>  3     5 e    
#>  4    10 j    
#>  5     6 f    
#>  6     2 b    
#>  7     7 g    
#>  8    11 k    
#>  9     3 c    
#> 10     8 h    
#> 11     9 i

In terms of timing, even on small data frames this is over 10 times faster than the original. I suspect the difference would be more marked on larger data frames:

microbenchmark::microbenchmark(reduce(l, interleave_rows), 
                               reduce(l, interleave_rows_OP))
#> Unit: milliseconds
#>                           expr     min       lq      mean   median       uq     max
#>     reduce(l, interleave_rows)  2.6741  2.94680  3.610404  3.05115  3.22800 21.5097
#>  reduce(l, interleave_rows_OP) 36.2170 37.82645 40.005754 38.90145 40.03415 57.3965
#>  neval
#>    100
#>    100

score 0 · Answer 2 · answered Jul 30 '20 at 22:25

From Allan Cameron's excellent answer I was able to do exactly what I wanted. I'm adding this answer just for reference on how to do this using tidy.

interleave_rows_tidy <- function(df_a, df_b) {
  if(nrow(df_b) > nrow(df_a)) return(interleave_rows_tidy(df_b, df_a))
  a <- df_a %>% nrow %>% seq
  b <- df_b %>% nrow %>% seq
  
  bind_rows(df_a, df_b) %>% arrange(c(a, length(a) * b/(length(b) + 1)))
}

The key feature for me was how to calculate the sequence. In case anyone is wondering here is the microbenchmarks.

> microbenchmark::microbenchmark(reduce(l, interleave_rows_tidy), reduce(l, interleave_rows_SO))
Unit: microseconds
                            expr     min        lq     mean    median       uq      max neval
 reduce(l, interleave_rows_tidy) 852.904 1088.5170 2586.924 1742.8185 4013.212 7401.947   100
   reduce(l, interleave_rows_SO) 504.500  636.9975 1251.016  769.7465 1357.512 4738.728   100

It looks like the tidy version is a little slower on this test data. Both of these are much faster than my original loop that added the rows one-by-one.

tidy functional way of interweaving rows of a data frame

2 Answers2