Create duplicate rows based on conditions in R

Question

I have a data.table that looks like this

dt <- data.table(ID=c("A","A","B","B"),Amount1=c(100,200,300,400),
                 Amount2=c(1500,1500,2400,2400),Dupl=c(1,0,1,0))

   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A     200    1500    0
3:  B     300    2400    1
4:  B     400    2400    0

I need to duplicate each row that has a 1 in the Dupl column and replace the Amount1 value with the Amount2 value in that duplicated row. Besides that I need to give that duplicated row the value 2 in Dupl. This means it should look like this:

   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A    1500    1500    2
3:  A     200    1500    0
4:  B     300    2400    1
5:  B    2400    2400    2
6:  B     400    2400    0

Any help is much appreciated! Kind regards,

Tim

Are there any other rules we need to know? Could there be four rows for a given ID? Does the "Dupl" column have values other than "1" and "2"? — A5C1D2H2I1M1N2O1R2T1, Mar 10 '15 at 10:50

zx8754 · Answer 1 · 2021-01-05T21:32:59.827

12

Using dplyr

library("data.table")
library("dplyr")

#data
dt <- data.table(ID = c("A", "A", "B", "B"),
                 Amount1 = c(100, 200, 300, 400),
                 Amount2 = c(1500, 1500, 2400, 2400),
                 Dupl = c(1, 0, 1, 0))
#result
rbind(dt,
      dt %>% 
        filter(Dupl == 1) %>% 
        mutate(Dupl = 2,
               Amount1 = Amount2))

#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A     200    1500    0
# 3:  B     300    2400    1
# 4:  B     400    2400    0
# 5:  A    1500    1500    2
# 6:  B    2400    2400    2

edited Jan 05 '21 at 21:32

answered Mar 10 '15 at 10:57

zx8754

52,746
12
114
209

1

I always try to avoid plyr since I work with large datasets, but this worked as well. Many thanks. – Tim_Utrecht Mar 10 '15 at 11:40

score 11 · Accepted Answer · answered Mar 10 '15 at 10:58

11

You could try

rbind(dt,dt[Dupl==1][,c('Amount1', 'Dupl') := list(Amount2, 2)])

answered Mar 10 '15 at 10:58

akrun

874,273
37
540
662

@DavidArenburg Thanks, I was also thinking in terms of `rleid` using your template code, but found to be a bit difficult. – akrun Mar 10 '15 at 11:04
How do you generalize this? For example my data.table has a column defining groups. I want to duplicate each whole group N times, where N is calculated as a function of some column. – skan Apr 22 '21 at 01:14
@skan can you post as a question so that it becomes more easier to understand – akrun Apr 22 '21 at 16:14

James · Answer 3 · 2015-03-10T11:20:45.250

You can rbind a copy of the sub-setted data with the correct transformations done:

rbind(dt,copy(dt[Dupl==1])[,Amount1:=Amount2][,Dupl:=Dupl+1])
   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A     200    1500    0
3:  B     300    2400    1
4:  B     400    2400    0
5:  A    1500    1500    2
6:  B    2400    2400    2

Alternatively, you can get the duplicates by sub-setting, and then transform the duplicated rows using an intermediate step. This keeps the duplicated row next to the original as in the example in the question:

x <- dt[rep(seq(dt[,Dupl]),times=dt[,Dupl==1]+1)]
x[duplicated(x),c("Amount1","Dupl"):=list(Amount2,Dupl+1)]
x
   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A    1500    1500    2
3:  A     200    1500    0
4:  B     300    2400    1
5:  B    2400    2400    2
6:  B     400    2400    0

You can do the first step in the second solution just by `x <- dt[rep(seq_len(.N), Dupl + 1L)]`. It seems like calling `dt` 3 times in this single step is a big overhead if `dt` is big. — David Arenburg, Mar 10 '15 at 11:51

score 3 · Answer 4 · answered Mar 10 '15 at 10:48

This seems to do what you are asking for. Can probably be refined a bit...

library(splitstackshape)
expandRows(dt, dt$Dupl+1, count.is.col = FALSE)[
  Dupl != 0, Dupl := cumsum(Dupl), by = ID][
    , Amount1 := ifelse(Dupl > 1, Amount2[-1], Amount1)][]
#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A    1500    1500    2
# 3:  A     200    1500    0
# 4:  B     300    2400    1
# 5:  B    2400    2400    2
# 6:  B     400    2400    0

score 2 · Answer 5 · edited Nov 09 '18 at 00:11

2

Using dplyr's left_join to do the duplication work. Perhaps not elegant, but should be easy to understand.

library(data.table)
library(dplyr)

joiner <- data.frame(Dupl = 1, helper_col= 1:2)

dt <- left_join(dt, joiner) %>%
  mutate(Dupl = ifelse(helper_col == 2 & !is.na(helper_col), 2, Dupl)) %>%
  select(-helper_col) %>%
  mutate(Amount1 = ifelse(Dupl == 2, Amount2, Amount1))

> dt
  ID Amount1 Amount2 Dupl
1  A     100    1500    1
2  A    1500    1500    2
3  A     200    1500    0
4  B     300    2400    1
5  B    2400    2400    2
6  B     400    2400    0

edited Nov 09 '18 at 00:11

Stephen Rauch

47,830
31
106
135

answered Nov 08 '18 at 23:52

Goldenmean1618

43
1
5

This solution is simple and understandable. It is also more flexible imho than the others because one can easily adapt the joiner data frame to handle different scenarios where you might want more than one duplication, e.g. expanding quarters into months. Viewing years later, I found this to be the most useful. – Zach S. Aug 08 '23 at 18:32

score 1 · Answer 6 · answered May 23 '20 at 23:37

Biased here, but I think this dplyr solution is elegant, and it's also pretty scalable, especially as long as Dupl is always <= 2. Essentially, it takes advantage of tidyr::uncount, which says, 'Based on a given column's value (x), repeat each row x times, thereby elongating the df.' Once we've lengthened the df, we can just use dplyr::mutate_at to replace cells if they're the same value as their lag.

library(tidyverse)
dt %>%
    uncount(Dupl + 1) %>%
    mutate_at(vars(Amount1),
              ~case_when(. == lag(.) ~ Amount2, TRUE ~.)) %>%
    mutate_at(vars(Dupl),
              ~case_when(. == lag(.) ~ 2, TRUE ~.))

#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A    1500    1500    2
# 3:  A     200    1500    0
# 4:  B     300    2400    1
# 5:  B    2400    2400    2
# 6:  B     400    2400    0

Create duplicate rows based on conditions in R

6 Answers6

Linked