3

I would like to "expand" a dataframe, duplicating the information on some columns the number of times indicated by a fifth column.

What would the most efficiency to achieve this task with R? (Open to Data Table or Dplyer, reshape solutions).

Original Dataframe/DataTable:

   f_1 f_2        d_1        d_2 i_1
1:   1   A 2016-01-01       <NA>  NA
2:   2   A 2016-01-02       <NA>  NA
3:   2   B 2016-01-03 2016-01-01   2
4:   3   C 2016-01-04       <NA>  NA
5:   4   D 2016-01-05 2016-01-02   5

Desired Dataframe/DataTable

    f_1 f_2        d_1        d_2 i_1
 1:   1   A 2016-01-01       <NA>  NA
 2:   2   A 2016-01-02       <NA>  NA
 3:   2   B 2016-01-03 2016-01-01   1
 4:   2   B 2016-01-03 2016-01-01   2
 5:   3   C 2016-01-04       <NA>  NA
 6:   4   D 2016-01-05 2016-01-02   1
 7:   4   D 2016-01-05 2016-01-02   2
 8:   4   D 2016-01-05 2016-01-02   3
 9:   4   D 2016-01-05 2016-01-02   4
10:   4   D 2016-01-05 2016-01-02   5

Reproducible data:

DT <- data.table(
  f_1 = factor(c(1,2,2,3,4)),
  f_2 = factor(c("A", "A", "B", "C", "D")),
  d_1 = as.Date(c("2016-01-01","2016-01-02","2016-01-03","2016-01-04","2016-01-05")),
  d_2 = as.Date(c(NA,NA,"2016-01-01",NA,"2016-01-02")),
  i_1 = as.integer(c(NA,NA,2,NA,5)))

Thanks and sorry if it is duplicated. I am struggling with this kind of reshaping exercises.

s_baldur
  • 29,441
  • 4
  • 36
  • 69
alvaropr
  • 699
  • 9
  • 20

2 Answers2

3

Here is a data.table solution. Basically, group by those columns that you want to duplicate and generate sequence of integers using the number in i_1

DT[, .(i_1=if(!is.na(i_1)) seq_len(i_1) else i_1), 
    by=c(names(DT)[-ncol(DT)])]

output:

    f_1 f_2        d_1        d_2 i_1
 1:   1   A 2016-01-01       <NA>  NA
 2:   2   A 2016-01-02       <NA>  NA
 3:   2   B 2016-01-03 2016-01-01   1
 4:   2   B 2016-01-03 2016-01-01   2
 5:   3   C 2016-01-04       <NA>  NA
 6:   4   D 2016-01-05 2016-01-02   1
 7:   4   D 2016-01-05 2016-01-02   2
 8:   4   D 2016-01-05 2016-01-02   3
 9:   4   D 2016-01-05 2016-01-02   4
10:   4   D 2016-01-05 2016-01-02   5

Or another way using data.table. For each row, create a sequence of numbers using i_1 and add the original data to that sequence with c(.SD[, -"i_1], ..... and finally remove the by column

DT[, c(.SD[, -"i_1"], .(i_1=if (!is.na(i_1)) seq_len(i_1) else i_1)), 
    by=seq_len(DT[,.N])][,-1L]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
3

Are you OK replacing i_1 with 1 when it's NA? If so, the following would be slightly more readable:

First, repeat the rows the specified number of times (ad hoc accounting for the missing values of i_1, using replace courtesy of @Frank):

DT_out = DT[rep(1:.N, replace(i_1, is.na(i_1), 1L))]

This could be just DT[rep(1:.N, i_1)] if we've already replaced DT[is.na(i_1), i_1 := 1L].

All that's left is to update the values of i_1. There are simpler versions of this, depending on your data's particulars. Here I think is the more general version:

DT_out[!is.na(i_1), i_1 := rowidv(.SD), .SDcols = !'i_1'][]
#     f_1 f_2        d_1        d_2 i_1
#  1:   1   A 2016-01-01       <NA>  NA
#  2:   2   A 2016-01-02       <NA>  NA
#  3:   2   B 2016-01-03 2016-01-01   1
#  4:   2   B 2016-01-03 2016-01-01   2
#  5:   3   C 2016-01-04       <NA>  NA
#  6:   4   D 2016-01-05 2016-01-02   1
#  7:   4   D 2016-01-05 2016-01-02   2
#  8:   4   D 2016-01-05 2016-01-02   3
#  9:   4   D 2016-01-05 2016-01-02   4
# 10:   4   D 2016-01-05 2016-01-02   5

rowid and rowidv give the row number within the groups defined by the variables it's passed. You can compare with rowid(f_2), rowid(f_1), and rowid(f_1, f_2) to get an idea of what I mean. rowidv(.SD) is a shorthand for rowid(f_1, f_2, d_1, d_2), since we exclude i_1 from the columns in .SD.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198