202
df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'),
                 freq = 1:3)

What is the simplest way to expand each row the first two columns of the data.frame above, so that each row is repeated the number of times specified in the column 'freq'?

In other words, go from this:

df
  var1 var2 freq
1    a    d    1
2    b    e    2
3    c    f    3

To this:

df.expanded
  var1 var2
1    a    d
2    b    e
3    b    e
4    c    f
5    c    f
6    c    f
Henrik
  • 65,555
  • 14
  • 143
  • 159
wkmor1
  • 7,226
  • 3
  • 31
  • 23

10 Answers10

201

Here's one solution:

df.expanded <- df[rep(row.names(df), df$freq), 1:2]

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • 2
    Great! I always forget you can use square brackets that way. I keep thinking of indexing just for subsetting or reordering. I had another solution that is far less elegant and no doubt less efficient. I might post anyway so that others can compare. – wkmor1 May 24 '10 at 10:30
  • 27
    For large `data.frame` more efficient is to replace `row.names(df)` with `seq.int(1,nrow(df))` or `seq_len(nrow(df))`. – Marek May 25 '10 at 11:54
  • This worked fantastically for a big data frame -- 1.5million rows, 5 cols, went very quick. Thanks! – gabe Nov 21 '12 at 06:16
  • 6
    1:2 hard codes the solution to this example, 1:ncol(df) will work for an arbitrary dataframe. – vladiim Aug 30 '18 at 06:28
113

old question, new verb in tidyverse:

library(tidyr) # version >= 0.8.0
df <- data.frame(var1=c('a', 'b', 'c'), var2=c('d', 'e', 'f'), freq=1:3)
df %>% 
  uncount(freq)

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
einar
  • 2,575
  • 1
  • 16
  • 14
48

Use expandRows() from the splitstackshape package:

library(splitstackshape)
expandRows(df, "freq")

Simple syntax, very fast, works on data.frame or data.table.

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f
Sotos
  • 51,121
  • 6
  • 32
  • 66
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
27

@neilfws's solution works great for data.frames, but not for data.tables since they lack the row.names property. This approach works for both:

df.expanded <- df[rep(seq(nrow(df)), df$freq), 1:2]

The code for data.table is a tad cleaner:

# convert to data.table by reference
setDT(df)
df.expanded <- df[rep(seq(.N), freq), !"freq"]
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
8

Another dplyr alternative with slice where we repeat each row number freq times

library(dplyr)

df %>%  
  slice(rep(seq_len(n()), freq)) %>% 
  select(-freq)

#  var1 var2
#1    a    d
#2    b    e
#3    b    e
#4    c    f
#5    c    f
#6    c    f

seq_len(n()) part can be replaced with any of the following.

df %>% slice(rep(1:nrow(df), freq)) %>% select(-freq)
#Or
df %>% slice(rep(row_number(), freq)) %>% select(-freq)
#Or
df %>% slice(rep(seq_len(nrow(.)), freq)) %>% select(-freq)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
7

I know this is not the case but if you need to keep the original freq column, you can use another tidyverse approach together with rep:

library(purrr)

df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'), freq = 1:3)

df %>% 
  map_df(., rep, .$freq)
#> # A tibble: 6 x 3
#>   var1  var2   freq
#>   <fct> <fct> <int>
#> 1 a     d         1
#> 2 b     e         2
#> 3 b     e         2
#> 4 c     f         3
#> 5 c     f         3
#> 6 c     f         3

Created on 2019-12-21 by the reprex package (v0.3.0)

rdornas
  • 630
  • 7
  • 15
6

In case you have to do this operation on very large data.frames I would recommend converting it into a data.table and use the following, which should run much faster:

library(data.table)
dt <- data.table(df)
dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")]
dt.expanded[ ,freq := NULL]
dt.expanded

See how much faster this solution is:

df <- data.frame(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(df.exp <- df[rep(row.names(df), df$freq), 1:2])
##    user  system elapsed 
##    4.57    0.00    4.56
dt <- data.table(df)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
##    user  system elapsed 
##    0.05    0.01    0.06
vonjd
  • 4,202
  • 3
  • 44
  • 68
  • 1
    I get an error: `Error in rep(1, freq) : invalid 'times' argument`. And given that there is already a data.table answer to this question, you may want to describe how your approach is different or when it is better than the current data.table answer. Or if there's not a major difference, you could add it as a comment to the existing answer instead. – Sam Firke Jul 07 '15 at 16:20
  • @SamFirke: Thank you for your comment. Strange, I just tried it again and I get no such error. Do you use the original `df`from the OP's question? My answer is better because the other answer is kind of misusing the `data.table` package by using `data.frame` syntax, see the FAQ of `data.table`: " It is generally bad practice to refer to columns by number rather than name." – vonjd Jul 07 '15 at 16:37
  • 1
    Thanks for the explanation. Your code works for me on the sample `df` posted by the OP, but when I tried to benchmark this on a larger data.frame I got that error. The data.frame I used was: `set.seed(1) dfbig <- data.frame(var1=sample(letters, 1000, replace = TRUE), var2=sample(LETTERS, 1000, replace = TRUE), freq=sample(1:10, 1000, replace = TRUE))` On the tiny data.frame, the base answer does well in my benchmarking, it just doesn't scale well to bigger data.frames. The other three answers ran successfully with this larger data.frame. – Sam Firke Jul 07 '15 at 17:06
  • @SamFirke: This is indeed strange, it should work there too and I don't know why it doesn't. Do you want to create a question out of it or shall I? – vonjd Jul 07 '15 at 17:18
  • Good idea. Can you? I don't know `data.table` syntax so I shouldn't be the one judging the answers. – Sam Firke Jul 07 '15 at 17:22
  • @SamFirke: Did it, so let's see what comes in... http://stackoverflow.com/questions/31276157/strange-error-when-expanding-data-table – vonjd Jul 07 '15 at 18:06
  • @SamFirke: We found the problem, it was quite a ride: http://stackoverflow.com/questions/31276157/strange-error-when-expanding-data-table – vonjd Jul 07 '15 at 19:17
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82644/discussion-between-sam-firke-and-vonjd). – Sam Firke Jul 07 '15 at 19:42
4

Another possibility is using tidyr::expand:

library(dplyr)
library(tidyr)

df %>% group_by_at(vars(-freq)) %>% expand(temp = 1:freq) %>% select(-temp)
#> # A tibble: 6 x 2
#> # Groups:   var1, var2 [3]
#>   var1  var2 
#>   <fct> <fct>
#> 1 a     d    
#> 2 b     e    
#> 3 b     e    
#> 4 c     f    
#> 5 c     f    
#> 6 c     f

One-liner version of vonjd's answer:

library(data.table)

setDT(df)[ ,list(freq=rep(1,freq)),by=c("var1","var2")][ ,freq := NULL][]
#>    var1 var2
#> 1:    a    d
#> 2:    b    e
#> 3:    b    e
#> 4:    c    f
#> 5:    c    f
#> 6:    c    f

Created on 2019-05-21 by the reprex package (v0.2.1)

M--
  • 25,431
  • 8
  • 61
  • 93
2

I am providing one more addition to this wonderful thread of nice answers! Use the tidyr package (included in tidyverse) for a one-liner solution:

df %>% tidyr::uncount(weights = freq)
a1a5a6
  • 41
  • 4
1

in fact. use the methods of vector and index. we can also achieve the same result, and more easier to understand:

rawdata <- data.frame('time' = 1:3, 
           'x1' = 4:6,
           'x2' = 7:9,
           'x3' = 10:12)

rawdata[rep(1, time=2), ] %>% remove_rownames()
#  time x1 x2 x3
# 1    1  4  7 10
# 2    1  4  7 10


yuanzz
  • 1,359
  • 12
  • 15