15

I have a data.table that looks like this

dt <- data.table(ID=c("A","A","B","B"),Amount1=c(100,200,300,400),
                 Amount2=c(1500,1500,2400,2400),Dupl=c(1,0,1,0))

   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A     200    1500    0
3:  B     300    2400    1
4:  B     400    2400    0

I need to duplicate each row that has a 1 in the Dupl column and replace the Amount1 value with the Amount2 value in that duplicated row. Besides that I need to give that duplicated row the value 2 in Dupl. This means it should look like this:

   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A    1500    1500    2
3:  A     200    1500    0
4:  B     300    2400    1
5:  B    2400    2400    2
6:  B     400    2400    0

Any help is much appreciated! Kind regards,

Tim

Tim_Utrecht
  • 1,459
  • 6
  • 24
  • 44

6 Answers6

12

Using dplyr

library("data.table")
library("dplyr")

#data
dt <- data.table(ID = c("A", "A", "B", "B"),
                 Amount1 = c(100, 200, 300, 400),
                 Amount2 = c(1500, 1500, 2400, 2400),
                 Dupl = c(1, 0, 1, 0))
#result
rbind(dt,
      dt %>% 
        filter(Dupl == 1) %>% 
        mutate(Dupl = 2,
               Amount1 = Amount2))

#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A     200    1500    0
# 3:  B     300    2400    1
# 4:  B     400    2400    0
# 5:  A    1500    1500    2
# 6:  B    2400    2400    2
zx8754
  • 52,746
  • 12
  • 114
  • 209
11

You could try

rbind(dt,dt[Dupl==1][,c('Amount1', 'Dupl') := list(Amount2, 2)])
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @DavidArenburg Thanks, I was also thinking in terms of `rleid` using your template code, but found to be a bit difficult. – akrun Mar 10 '15 at 11:04
  • How do you generalize this? For example my data.table has a column defining groups. I want to duplicate each whole group N times, where N is calculated as a function of some column. – skan Apr 22 '21 at 01:14
  • @skan can you post as a question so that it becomes more easier to understand – akrun Apr 22 '21 at 16:14
4

You can rbind a copy of the sub-setted data with the correct transformations done:

rbind(dt,copy(dt[Dupl==1])[,Amount1:=Amount2][,Dupl:=Dupl+1])
   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A     200    1500    0
3:  B     300    2400    1
4:  B     400    2400    0
5:  A    1500    1500    2
6:  B    2400    2400    2

Alternatively, you can get the duplicates by sub-setting, and then transform the duplicated rows using an intermediate step. This keeps the duplicated row next to the original as in the example in the question:

x <- dt[rep(seq(dt[,Dupl]),times=dt[,Dupl==1]+1)]
x[duplicated(x),c("Amount1","Dupl"):=list(Amount2,Dupl+1)]
x
   ID Amount1 Amount2 Dupl
1:  A     100    1500    1
2:  A    1500    1500    2
3:  A     200    1500    0
4:  B     300    2400    1
5:  B    2400    2400    2
6:  B     400    2400    0
James
  • 65,548
  • 14
  • 155
  • 193
  • 1
    You can do the first step in the second solution just by `x <- dt[rep(seq_len(.N), Dupl + 1L)]`. It seems like calling `dt` 3 times in this single step is a big overhead if `dt` is big. – David Arenburg Mar 10 '15 at 11:51
3

This seems to do what you are asking for. Can probably be refined a bit...

library(splitstackshape)
expandRows(dt, dt$Dupl+1, count.is.col = FALSE)[
  Dupl != 0, Dupl := cumsum(Dupl), by = ID][
    , Amount1 := ifelse(Dupl > 1, Amount2[-1], Amount1)][]
#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A    1500    1500    2
# 3:  A     200    1500    0
# 4:  B     300    2400    1
# 5:  B    2400    2400    2
# 6:  B     400    2400    0
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
2

Using dplyr's left_join to do the duplication work. Perhaps not elegant, but should be easy to understand.

library(data.table)
library(dplyr)

joiner <- data.frame(Dupl = 1, helper_col= 1:2)

dt <- left_join(dt, joiner) %>%
  mutate(Dupl = ifelse(helper_col == 2 & !is.na(helper_col), 2, Dupl)) %>%
  select(-helper_col) %>%
  mutate(Amount1 = ifelse(Dupl == 2, Amount2, Amount1))

> dt
  ID Amount1 Amount2 Dupl
1  A     100    1500    1
2  A    1500    1500    2
3  A     200    1500    0
4  B     300    2400    1
5  B    2400    2400    2
6  B     400    2400    0
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • This solution is simple and understandable. It is also more flexible imho than the others because one can easily adapt the joiner data frame to handle different scenarios where you might want more than one duplication, e.g. expanding quarters into months. Viewing years later, I found this to be the most useful. – Zach S. Aug 08 '23 at 18:32
1

Biased here, but I think this dplyr solution is elegant, and it's also pretty scalable, especially as long as Dupl is always <= 2. Essentially, it takes advantage of tidyr::uncount, which says, 'Based on a given column's value (x), repeat each row x times, thereby elongating the df.' Once we've lengthened the df, we can just use dplyr::mutate_at to replace cells if they're the same value as their lag.

library(tidyverse)
dt %>%
    uncount(Dupl + 1) %>%
    mutate_at(vars(Amount1),
              ~case_when(. == lag(.) ~ Amount2, TRUE ~.)) %>%
    mutate_at(vars(Dupl),
              ~case_when(. == lag(.) ~ 2, TRUE ~.))

#    ID Amount1 Amount2 Dupl
# 1:  A     100    1500    1
# 2:  A    1500    1500    2
# 3:  A     200    1500    0
# 4:  B     300    2400    1
# 5:  B    2400    2400    2
# 6:  B     400    2400    0
jackbdolg
  • 105
  • 6