21

I have a trouble with repeating rows of my real data using dplyr. There is already another post in here repeat-rows-of-a-data-frame but no solution for dplyr.

Here I just wonder how could be the solution for dplyr but failed with error:

Error: wrong result size (16), expected 4 or 1

library(dplyr)
    df <- data.frame(column = letters[1:4])

    df_rep <- df%>%
      mutate(column=rep(column,each=4))

Expected output

>df_rep 
    column
    #a
    #a
    #a
    #a
    #b
    #b
    #b
    #b
    #*
    #*
    #*
Community
  • 1
  • 1
Alexander
  • 4,527
  • 5
  • 51
  • 98
  • 1
    The only way I can think of is to pipe into a `do` block and, from the current data.frame, generate a new one as you want here (`df %>% do(data.frame(column = rep(.$column, 4)))`). This is fraught with peril, though, if the data.frame has any other columns. – r2evans Jul 07 '16 at 04:36
  • @r2evans works great. You can sent it as an answer. just need to change `do(data.frame(a = rep(.$a, each=4)))`. – Alexander Jul 07 '16 at 04:41

4 Answers4

34

Using the uncount function will solve this problem as well. The column count indicates how often a row should be repeated.

library(tidyverse)

df <- tibble(letters = letters[1:4])

df 
# A tibble: 4 x 1
  letters
  <chr>  
1 a      
2 b      
3 c      
4 d 

df %>%
  mutate(count = c(2, 3, 2, 4)) %>%
  uncount(count)

# A tibble: 11 x 1
   letters
   <chr> 
 1 a      
 2 a      
 3 b      
 4 b      
 5 b      
 6 c      
 7 c      
 8 d      
 9 d      
10 d      
11 d  
MrNetherlands
  • 920
  • 7
  • 14
15

I was looking for a similar (but slightly different) solution. Posting here in case it's useful to anyone else.

In my case, I needed a more general solution that allows each letter to be repeated an arbitrary number of times. Here's what I came up with:

library(tidyverse)

df <- data.frame(letters = letters[1:4])
df

> df
  letters
1       a
2       b
3       c
4       d

Let's say I want 2 A's, 3 B's, 2 C's and 4 D's:

df %>% 
  mutate(count = c(2, 3, 2, 4)) %>% 
  group_by(letters) %>% 
  expand(count = seq(1:count))

# A tibble: 11 x 2
# Groups:   letters [4]
   letters count
    <fctr> <int>
 1       a     1
 2       a     2
 3       b     1
 4       b     2
 5       b     3
 6       c     1
 7       c     2
 8       d     1
 9       d     2
10       d     3
11       d     4

If you don't want to keep the count column:

df %>% 
  mutate(count = c(2, 3, 2, 4)) %>% 
  group_by(letters) %>% 
  expand(count = seq(1:count)) %>% 
  select(letters)

# A tibble: 11 x 1
# Groups:   letters [4]
   letters
    <fctr>
 1       a
 2       a
 3       b
 4       b
 5       b
 6       c
 7       c
 8       d
 9       d
10       d
11       d

If you want the count to reflect the number of times each letter is repeated:

df %>% 
  mutate(count = c(2, 3, 2, 4)) %>% 
  group_by(letters) %>% 
  expand(count = seq(1:count)) %>% 
  mutate(count = max(count))

# A tibble: 11 x 2
# Groups:   letters [4]
   letters count
    <fctr> <dbl>
 1       a     2
 2       a     2
 3       b     3
 4       b     3
 5       b     3
 6       c     2
 7       c     2
 8       d     4
 9       d     4
10       d     4
11       d     4
Brad Cannell
  • 3,020
  • 2
  • 23
  • 39
  • looked nice, but this was very slow in my experience – ozgeneral Mar 17 '18 at 22:51
  • Another way of doing this is using a `join` operation. You specify in a `data.frame` what information should be appended for which letter, and then join it `by` the column letters. `library(dplyr); df <- data.frame(LETTERS = LETTERS[1:4],letters = c(letters[1:2],letters[1:2])); rows_to_repeat <- bind_rows(data.frame(letters="a", counter=1:2),data.frame(letters="b", counter=1:3)); left_join(df, rows_to_repeat)`. Note that it repeated rows "C" and "D", according to the rules specified for "a" and "b". – fabern Sep 21 '18 at 13:34
10

This is rife with peril if the data.frame has other columns (there, I said it!), but the do block will allow you to generate a derived data.frame within a dplyr pipe (though, ceci n'est pas un pipe):

library(dplyr)
df <- data.frame(column = letters[1:4], stringsAsFactors = FALSE)
df %>%
  do( data.frame(column = rep(.$column, each = 4), stringsAsFactors = FALSE) )
#    column
# 1       a
# 2       a
# 3       a
# 4       a
# 5       b
# 6       b
# 7       b
# 8       b
# 9       c
# 10      c
# 11      c
# 12      c
# 13      d
# 14      d
# 15      d
# 16      d

As @Frank suggested, a much better alternative could be

df %>% slice(rep(1:n(), each=4))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 41
    In my experience, `do` is very slow. You can `slice` like `df %>% slice(rep(1:n(), each=4))`. This also handles the case of more columns. – Frank Jul 07 '16 at 05:04
  • 2
    Nice alternative, and it's certainly more elegant. I was trying to come up with something like that but my brain kept rebelling. Thanks, Frank! (And I agree, `do` tends to bog things down, a known bottleneck.) – r2evans Jul 07 '16 at 05:06
  • Frank has the best and simplest solution here. – Jamie Tock Mar 23 '22 at 18:59
  • @JamieTock I agree. This was written many years ago, and both my proficiency and the package have changed since then. I don't make it a habit to go back and edit all past answers. Feel free to suggest an edit if you'd like – r2evans Mar 23 '22 at 21:04
3

I did a quick benchmark to show that uncount() is a lot faster than expand()

# for the pipe
library(magrittr)

# create some test data
df_test <- 
  tibble::tibble(
    letter = letters,
    row_count = sample(1:10, size = 26, replace = TRUE)
  )

# benchmark
bench <- microbenchmark::microbenchmark(
  expand = df_test %>%
    dplyr::group_by(letter) %>%
    tidyr::expand(row_count = seq(1:row_count)),
  uncount = df_test %>%
    tidyr::uncount(row_count)
)

# plot the benchmark
ggplot2::autoplot(bench)

Benchmark plot

Eddd
  • 69
  • 3