duplicate rows in a data frame in R

Question

I am trying to duplicate rows in my data frame using the code below. However, I'm finding it to be slow.

duprow = df[1,]
for(i in 1:2000)
{
    print(i)
    df = rbind(df,duprow)
}

Is there a faster way?

score 19 · Answer 1 · answered Apr 20 '15 at 09:17

19

You can use rep, e.g. for 5 duplicates or row 1:

df <- data.frame(x = 1, y = 1)
rbind(df, df[rep(1, 5), ])
#     x y
# 1   1 1
# 11  1 1
# 1.1 1 1
# 1.2 1 1
# 1.3 1 1
# 1.4 1 1

answered Apr 20 '15 at 09:17

lukeA

53,097
5
97
100

That's a clever trick to just rep the row number of the row you want to rep as an index to the df. – DashdotdotDashdotdot Feb 03 '23 at 01:19

score 15 · Answer 2 · answered Aug 21 '18 at 18:49

Here's my crack at it:

> # create an example data frame
> colornames=c("violet","indigo","blue","green","yellow","orange","red")
> wavelength=c(400,425,470,550,600,630,665)
> df <- data.frame(colornames, wavelength)
> 
> # How many replicates you want of each row
> duptimes <- c(0,1,2,1,1,4,1)
> 
> # Create an index of the rows you want with duplications
> idx <- rep(1:nrow(df), duptimes)
> 
> # Use that index to genderate your new data frame
> dupdf <- df[idx,]
> 
> # display results
> df
  colornames wavelength
1     violet        400
2     indigo        425
3       blue        470
4      green        550
5     yellow        600
6     orange        630
7        red        665
> dupdf
    colornames wavelength
2       indigo        425
3         blue        470
3.1       blue        470
4        green        550
5       yellow        600
6       orange        630
6.1     orange        630
6.2     orange        630
6.3     orange        630
7          red        665

I don't know if this is any faster, but it doesn't require loading additional packages and also removes unwanted rows.

Downside is you need to make decisions about each row in the data frame, but that shouldn't be too difficult to code in.

This worked well for me and was fast using a dataframe that started with 1 million rows. If you want to make the same number of repetitions for each row you can use `reptimes <- 12; idx <- rep(1:nrow(df), reptimes); rep_df <- df[idx, ]` — mikey, Apr 29 '19 at 14:48
This trick is ingenious. It also works for vectors and you can apply it multiple times to nest data. — piegames, Dec 18 '20 at 17:30

Adriaan Nering Bögel · Answer 3 · 2020-11-19T14:45:09.337

I had a similar problem which I wanted to solve in a tidy way using dplyr. I ended up filtering the designated rows from my dataframe based on rownumber using dplyr::filter() and dplyr::row_number(). And binding them to the original dataframe using dplyr::bind_rows(), all in one pipe. In your example it would be something like this:

df %>% 
  filter(row_number() <= 2000) %>% 
  bind_rows(df)

Fast and easy if you want to duplicate specific rows! Off course you can use specific rownumbers to duplicate, using filter(row_number() %in% c(...)).

score 3 · Answer 4 · edited May 23 '17 at 12:16

Luke's answer using rep() does your job for now, but these answers below might be able to help you in the longer run.

Please take a look at this answer on speeding up rbind about why it is slow and not to use loops. It also has code to preallocate your dataframe. Also see jorans Second circle of hell comment.
Suggestion rbind.fill From @coanil

Two things I'd like to add: 1) Generally, if you don't want to use data.table, you can use the rbind.fill function in Hadley's plyr package, which is quite fast, too. Never use rbind the way you did above, in a 'for' loop, appending each row separately. It forces R to make a copy of the data frame object every time you append one row, and that is slow.

https://stackoverflow.com/a/19699342/4606130

If you go data.table route, then use rbindlist which is faster. (@David suggests this in the first answer link.)

score 3 · Answer 5 · answered Apr 20 '15 at 10:11

3

I had a similar issue yesterday and there is also this package called 'splitstackshape'. Then it is as simple as the following code:

library(splitstackshape)
df <- data.frame(x = 1, y = 1)
df2 <- expandRows(df, count=2000, count.is.col=FALSE)

You might also want to 'fix' the rownames by doing

rownames(df2) <- 1:2000

answered Apr 20 '15 at 10:11

Maarten

53
4

I usually just use `rownames(df2) <- NULL` to achieve the same effect. Or, if the input is a `data.table`, there won't be row names to begin with. Eg: `expandRows(as.data.table(df), count = 2000, count.is.col = FALSE)` – A5C1D2H2I1M1N2O1R2T1 Apr 20 '15 at 10:40

duplicate rows in a data frame in R

5 Answers5

Linked