Is there a better R way to expand a dataframe by a function on rows?

Question

Question:

Below works, but is there a better "R way" of achieving similar result? I am essentially trying to create / distribute groups into individual line items according to a user defined function (currently just using a loop).

Example:

df1 <- data.frame(group = c("A", "B", "C"), 
              volume = c(200L, 45L, 104L)
              )
print(df1)
#>   group volume
#> 1     A    200
#> 2     B     45
#> 3     C    104

I want the volume to be broken across multiple rows according to group so that the final result is a dataframe where the new volume (vol2 in the below) would add up to original volume above. In this example, I'm applying integer math with a divisor of 52, so my final result should be:

print(df3)
#>    group vol2
#> 1      A   52
#> 2      A   52
#> 3      A   52
#> 4      A   44
#> 21     B   45
#> 31     C   52
#> 32     C   52

This works

The code below DOES get me to the desired result shown above:

div <- 52L
df1$intgr <- df1$volume %/% div
df1$remainder <- df1$volume %% div

print(df1)
#>   group volume intgr remainder
#> 1     A    200     3        44
#> 2     B     45     0        45
#> 3     C    104     2         0

df2 <- data.frame()

for (r in 1:nrow(df1)){
  if(df1[r,"intgr"] > 0){
    for (k in 1:as.integer(df1[r,"intgr"])){
      df1[r,"vol2"] <- div
      df2 <- rbind(df2, df1[r,])
    }
  }
  if(df1[r,"remainder"]>0){
    df1[r, "vol2"] <- as.integer(df1[r, "remainder"])
    df2 <- rbind(df2, df1[r,])
  }
}

print(df2)
#>    group volume intgr remainder vol2
#> 1      A    200     3        44   52
#> 2      A    200     3        44   52
#> 3      A    200     3        44   52
#> 4      A    200     3        44   44
#> 21     B     45     0        45   45
#> 31     C    104     2         0   52
#> 32     C    104     2         0   52

df3 <- subset(df2, select = c("group", "vol2"))

print(df3)
#>    group vol2
#> 1      A   52
#> 2      A   52
#> 3      A   52
#> 4      A   44
#> 21     B   45
#> 31     C   52
#> 32     C   52

Being still relatively new to R, I'm just curious if someone knows a better way / function / method that gets to the same place. Seems like there might be. I could potentially have a more complex way of breaking up the rows and I was thinking maybe there's a method that applies a UDF to the dataframe to do something like this. I was searching for "expand group/groups" but was finding mostly "expand.grid" which isn't what I'm doing here.

Thank you for any suggestions!

Why does the 4th row of `vol2` column show 44, not 52 as its first three rows? — Abdur Rohman, Dec 01 '22 at 21:31
@AbdurRohman - `200 %% 52` is `44` - it's the remainder after breaking out the `200` into rows of `52` each. — thelatemail, Dec 01 '22 at 21:33

r2evans · Accepted Answer · 2022-12-02T02:16:54.120

8

A quick function to help split each number by the modulus,

fun <- function(num, mod) c(rep(mod, floor(num / mod)), (num-1) %% mod + 1)
fun(200, 52)
# [1] 52 52 52 44
fun(45, 52)
# [1] 45
fun(104, 52)
# [1] 52 52

And we can apply this a number of ways:

dplyr

library(dplyr)
df1 %>%
  group_by(group) %>%
  summarize(vol2 = fun(volume, 52), .groups = "drop")
# # A tibble: 7 x 2
#   group  vol2
#   <chr> <dbl>
# 1 A        52
# 2 A        52
# 3 A        52
# 4 A        44
# 5 B        45
# 6 C        52
# 7 C        52

base R

do.call(rbind, by(df1, seq(nrow(df1)),
                  FUN = function(z) data.frame(group = z$group, vol2 = fun(z$volume, 52))))

data.table

library(data.table)
setDT(df1)
df1[, .(vol2 = fun(volume, 52)), by = group]

edited Dec 02 '22 at 02:16

answered Dec 01 '22 at 21:47

r2evans

141,215
6
77
149

1

Really nice and simple. – stefan Dec 01 '22 at 21:56
1

Thanks @stefan ... though I think the base R code shows why we need something like dplyr's and data.table's code to easily calculate by-group – r2evans Dec 01 '22 at 21:57
Mind==>BLOWN! You are using `group_by()/summarize()` to **EXPAND** a dataframe? I had no idea it would do that, as I have only experienced `summarize` for collapsing a dataframe. Still wrapping my head around that. Appreciate the multiple solutions, and I'm reading up on data.table as I'm unfamiliar. I love the function, but question: Beyond personal preference, is there a reason you used `ceiling(num / mod)-1` instead of slightly shorter (and more obvious to me) `floor(num / mod)`? Thanks a bunch! – ScottyJ Dec 02 '22 at 02:09
No reason for ceiling over floor, there were some quick iterations on possible solutions, and when I tried that it worked and I didn't look back to reduce it to `floor`. – r2evans Dec 02 '22 at 02:16
TBH, I don't recall exactly when `summarize` started allowing more than 1 row per group; if you look at the newest blog post about the [`dplyr-1.1.0` impending release](https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/), you'll see reference to `reframe` as a way to visualize `summarize` but allowing it to return more than just 1 row. Perhaps they've been silently testing the premise, I don't know. But it's been working for me, at least with 1.0.8, not sure about earlier. – r2evans Dec 02 '22 at 02:19
I like `data.table`, and for some projects and some people, I strongly recommend it. There is a learning curve, but when you get past the hump of that curve, it starts paying in some nice dividends, mostly when your data gets to be a bit big. Having said that ... the way `dplyr` can operate lazily on some things like SQL databases and arrow/parquet dataset connections is very powerful. They can do similar (lazy) things using `dtplyr` (with which I have no experience). – r2evans Dec 02 '22 at 02:22
1

Yes, I did. USNA '97 grad. Seven years active as a sub nuke (USS ASHEVILLE 758) and got out in 2004. Good people, good friends. I expect `flying > submerging` -- except when surface OOD in the morning off Pearl Harbor ;-). Carry on, sir! – ScottyJ Dec 02 '22 at 02:51
Not sure about `>`, I lean more heavily towards `!=`. Though ... subs don't get much of a ground-rush from speeding, not visually at least. – r2evans Dec 02 '22 at 12:24
1

Reading that [dplyr-1.1.0 impending release](https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/) a bit more carefully, it looks like the `group_by()/summarize()` solution will **NOT** work once that release occurs, but that `reframe()` function is specifically being created to do that. I am not going to install the test version to check it now, but once that comes out, I think that your answer may benefit from an update. – ScottyJ Dec 02 '22 at 18:31

score 3 · Answer 2 · answered Dec 01 '22 at 21:44

A tidyverse approach using purrr::pmap and tidyr::unnest_longer may look like so:

library(dplyr, w = FALSE)
library(tidyr)
library(purrr)

div <- 52

df1 |> 
  mutate(intgr = volume %/% div, remainder = volume %% div, intgr1 = +(remainder > 0)) |> 
  mutate(vol2 = purrr::pmap(list(intgr, intgr1, remainder), ~ c(rep(div, ..1), rep(..3, ..2)))) |> 
  tidyr::unnest_longer(vol2) |> 
  select(-intgr1)
#> # A tibble: 7 × 5
#>   group volume intgr remainder  vol2
#>   <chr>  <int> <dbl>     <dbl> <dbl>
#> 1 A        200     3        44    52
#> 2 A        200     3        44    52
#> 3 A        200     3        44    52
#> 4 A        200     3        44    44
#> 5 B         45     0        45    45
#> 6 C        104     2         0    52
#> 7 C        104     2         0    52

This is a great answer, too. I was thinking something along the lines of `pivot_longer` was what I would want, but couldn't see a way to use it, and the `unnest_longer()` was unknown to me until now (I have so much to learn!). Your use of `+(remainder > 0)` is a short cut instead of `as.integer()` to coerce the logical? Why does that work? It's shorter, but I haven't seen that before. Thank you very much for this answer! — ScottyJ, Dec 02 '22 at 03:02

jblood94 · Answer 3 · 2022-12-02T15:35:02.087

3

With data.table and rep:

library(data.table)

setDT(df1)[, .(vol2 = c(rep(52, volume%/%52), (volume%%52)[sign(volume%%52)])), group]
#>    group vol2
#> 1:     A   52
#> 2:     A   52
#> 3:     A   52
#> 4:     A   44
#> 5:     B   45
#> 6:     C   52
#> 7:     C   52

Or

setDT(df1)[, .(vol2 = c(rep(52, volume%/%52), volume%%52)), group][vol2 != 0]
#>    group vol2
#> 1:     A   52
#> 2:     A   52
#> 3:     A   52
#> 4:     A   44
#> 5:     B   45
#> 6:     C   52
#> 7:     C   52

edited Dec 02 '22 at 15:35

answered Dec 01 '22 at 21:46

jblood94

10,340
1
10
15

I wasn't expecting a one-liner answer/solution. Wow. – ScottyJ Dec 02 '22 at 04:05
What does the extra empty square brackets `[ ]` do at the end of your solutions? I get the same result without them. – ScottyJ Dec 02 '22 at 14:11
That's just to print it to the console. Just to show the result. You'll most likely take them off in your code. – jblood94 Dec 02 '22 at 14:31
Ah. Okay. It prints to my RStudio console both ways, whether I source with echo or not (what am I missing?). I really like the idea of using `sign(volume%%52)` to add the remainder if it exists. That trick seems particularly useful for all sorts of things. I'm going to pick @r2evans' answer as the solution (this is similar to his 3rd solution, though more condensed) because of the variety of approaches, but this wins for brevity and deserves respect. Thank you! – ScottyJ Dec 02 '22 at 14:48
You're right. It's not doing a by-reference assignment with the grouping operation. It should print either way. – jblood94 Dec 02 '22 at 15:23

thelatemail · Answer 4 · 2022-12-01T22:20:07.727

2

Vectorised and without grouping:

df1 <- data.frame(group = c("A", "B", "C"),
                  volume = c(200L, 45L, 104L))

n <- 52
idx <- df1$volume %/% n + ((sel <- df1$volume %% n) != 0)
out <- df1[rep(seq_len(nrow(df1)), idx),]
out$volume <- n
out$volume[cumsum(idx)[sel != 0]] <- sel[sel != 0]
##    group volume
##1       A     52
##1.1     A     52
##1.2     A     52
##1.3     A     44
##2       B     45
##3       C     52
##3.1     C     52

edited Dec 01 '22 at 22:20

answered Dec 01 '22 at 22:14

thelatemail

91,185
12
128
188

This is interesting. It's a bit hard to follow until I break out each piece, but it certainly beats my code. I think the assignment within an assignment is new to me: `sel <- df1$volume %%n`. I can follow @stefan and @r2evans solutions (and probably apply them to new situations) easier, but this works and makes me think. Thank you! – ScottyJ Dec 02 '22 at 03:38

score 1 · Answer 5 · answered Dec 04 '22 at 02:06

Another base R solution using aggregate :

aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1))

  group         volume
1     A 52, 52, 52, 44
2     B             45
3     C     52, 52, 52

This results in a list column for volume (could be useful)

To transform it to a long dataframe we can either use stack:


with(
  aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1)),
  setNames(stack(setNames(volume,group))[2:1],names(df1))
  )

  group volume
1     A     52
2     A     52
3     A     52
4     A     44
5     B     45
6     C     52
7     C     52
8     C     52

Or alternatively use unnest from tidyr

library(tidyr)

aggregate(.~group,df1,\(x) c(rep(52, x / 52), (x-1) %% 52 + 1)) %>% unnest(volume)

# A tibble: 8 × 2
  group volume
  <chr>  <dbl>
1 A         52
2 A         52
3 A         52
4 A         44
5 B         45
6 C         52
7 C         52
8 C         52