19

I am trying to duplicate rows in my data frame using the code below. However, I'm finding it to be slow.

duprow = df[1,]
for(i in 1:2000)
{
    print(i)
    df = rbind(df,duprow)
}

Is there a faster way?

josliber
  • 43,891
  • 12
  • 98
  • 133
tubby
  • 2,074
  • 3
  • 33
  • 55

5 Answers5

19

You can use rep, e.g. for 5 duplicates or row 1:

df <- data.frame(x = 1, y = 1)
rbind(df, df[rep(1, 5), ])
#     x y
# 1   1 1
# 11  1 1
# 1.1 1 1
# 1.2 1 1
# 1.3 1 1
# 1.4 1 1
lukeA
  • 53,097
  • 5
  • 97
  • 100
15

Here's my crack at it:

> # create an example data frame
> colornames=c("violet","indigo","blue","green","yellow","orange","red")
> wavelength=c(400,425,470,550,600,630,665)
> df <- data.frame(colornames, wavelength)
> 
> # How many replicates you want of each row
> duptimes <- c(0,1,2,1,1,4,1)
> 
> # Create an index of the rows you want with duplications
> idx <- rep(1:nrow(df), duptimes)
> 
> # Use that index to genderate your new data frame
> dupdf <- df[idx,]
> 
> # display results
> df
  colornames wavelength
1     violet        400
2     indigo        425
3       blue        470
4      green        550
5     yellow        600
6     orange        630
7        red        665
> dupdf
    colornames wavelength
2       indigo        425
3         blue        470
3.1       blue        470
4        green        550
5       yellow        600
6       orange        630
6.1     orange        630
6.2     orange        630
6.3     orange        630
7          red        665

I don't know if this is any faster, but it doesn't require loading additional packages and also removes unwanted rows.

Downside is you need to make decisions about each row in the data frame, but that shouldn't be too difficult to code in.

Andrew
  • 301
  • 3
  • 3
  • 3
    This worked well for me and was fast using a dataframe that started with 1 million rows. If you want to make the same number of repetitions for each row you can use `reptimes <- 12; idx <- rep(1:nrow(df), reptimes); rep_df <- df[idx, ]` – mikey Apr 29 '19 at 14:48
  • 1
    This trick is ingenious. It also works for vectors and you can apply it multiple times to nest data. – piegames Dec 18 '20 at 17:30
5

I had a similar problem which I wanted to solve in a tidy way using dplyr. I ended up filtering the designated rows from my dataframe based on rownumber using dplyr::filter() and dplyr::row_number(). And binding them to the original dataframe using dplyr::bind_rows(), all in one pipe. In your example it would be something like this:

df %>% 
  filter(row_number() <= 2000) %>% 
  bind_rows(df)

Fast and easy if you want to duplicate specific rows! Off course you can use specific rownumbers to duplicate, using filter(row_number() %in% c(...)).

3

Luke's answer using rep() does your job for now, but these answers below might be able to help you in the longer run.

  1. Please take a look at this answer on speeding up rbind about why it is slow and not to use loops. It also has code to preallocate your dataframe. Also see jorans Second circle of hell comment.

  2. Suggestion rbind.fill From @coanil

    Two things I'd like to add: 1) Generally, if you don't want to use data.table, you can use the rbind.fill function in Hadley's plyr package, which is quite fast, too. Never use rbind the way you did above, in a 'for' loop, appending each row separately. It forces R to make a copy of the data frame object every time you append one row, and that is slow.

https://stackoverflow.com/a/19699342/4606130

  1. If you go data.table route, then use rbindlist which is faster. (@David suggests this in the first answer link.)
Community
  • 1
  • 1
micstr
  • 5,080
  • 8
  • 48
  • 76
3

I had a similar issue yesterday and there is also this package called 'splitstackshape'. Then it is as simple as the following code:

library(splitstackshape)
df <- data.frame(x = 1, y = 1)
df2 <- expandRows(df, count=2000, count.is.col=FALSE)

You might also want to 'fix' the rownames by doing

rownames(df2) <- 1:2000
Maarten
  • 53
  • 4
  • I usually just use `rownames(df2) <- NULL` to achieve the same effect. Or, if the input is a `data.table`, there won't be row names to begin with. Eg: `expandRows(as.data.table(df), count = 2000, count.is.col = FALSE)` – A5C1D2H2I1M1N2O1R2T1 Apr 20 '15 at 10:40