How can I simplify this code (r) in which I am using information from an original data set to create a new dataset?

Question

I have a data set that I am trying to use to generate a different data set in R. The dataset has many columns; but the three relevant columns for generating the new data set are "Reach", "Results", and "DV". Reach and results are numeric. DV is binary with 0s and 1s. In the original dataset, all rows have DV = 0.

For each row of the original data set, I am attempting to take one variable "Reach" and replicate that row "reach" number of times. Then for this new set of rows, I want to change DV from 0 to 1 for "results" number (from the original row) of the new rows.

For example, in row 33 of the original data set: Reach = 1004, Results = 45, DV = 0. The new data set should have row 33 replicated 1004 times, for 45 of those new rows DV should be changed from 0 to 1.

The code I wrote for the task works... but it is taking 10+ hours to run because the file is so large. Any ideas for how to simplify this code so it can process more quickly

empty_new.video <- new.video[FALSE,]
for(i in 1:nrow(new.video)){
  n.times <- new.video[i,'Reach'] #determine number of times to repeat rows
  if (n.times > 0){
    for (j in 1:n.times){
      empty_new.video[nrow(empty_new.video) + 1 , ] <- new.video[i,]
    }
  }
  dv.times <- new.video[i,'Results'] #creating dependent variable 
  if (dv.times>0){
    for (k in 1:dv.times){
      empty_new.video[nrow(empty_new.video) - n.times + k,'DV'] <- 1
    }
  }
}

Please see https://stackoverflow.com/questions/19697700/how-to-speed-up-rbind. `rep` function may help as well: https://stackoverflow.com/questions/14693956/how-can-i-prevent-rbind-from-geting-really-slow-as-dataframe-grows-larger/14694108#14694108 — Grzegorz Sapijaszko, Jan 25 '22 at 16:52
@GrzegorzSapijaszko op doesn't even use rbind in the example, how could speeding up rbind help? — rawr, Jan 25 '22 at 17:04
I meant to create a subset of required rows with rep, and then rbind it to final df. — Grzegorz Sapijaszko, Jan 26 '22 at 08:49

score 0 · Answer 1 · answered Jan 25 '22 at 17:02

Avoid growing objects in loop. Consider Map (wrapper to mapply) to iterate through all original dataset's columns elementwise to build a list of data frames to eventually concatenate once at the end.

build_rows <- function(reach, results) {
    # DATA FRAME TO REPLICATE REACH BY ITS LENGTH
    df <- data.frame(id = reach, reach = 1:reach, dv = 0)

    # RANDOMLY ASSIGN N ROWS TO 1 (N=RESULTS)  
    df$dv[sample(1:nrow(df), results),] = 1 

    # ASSIGN FIRST N ROWS TO 1 (N=RESULTS)
    df$dv[1:results,] = 1 

    return(df)
}

df_list <- Map(build_rows, original_data$Reach, original_data$Results)

final_df <- do.call(rbind, df_list)

score 0 · Accepted Answer · answered Jan 25 '22 at 17:03

Rather than a loop to do everything at once, you could define a simple function that does this for one row and check the results

dd <- data.frame(Reach = c(5, 3), Results = c(4, 1), DV = c(0, 0))
#   Reach Results DV
# 1     5       4  0
# 2     3       1  0

f <- function(data) {
  nr <- data$Reach
  nd <- data$Results
  data <- data[rep_len(1L, nr), ]
  data$DV <- rep(0:1, c(nr - nd, nd))
  rownames(data) <- NULL
  data
}
f(dd[1, ])

Then loop for every row

res <- lapply(split(dd, rownames(dd)), f)
do.call('rbind', res)
#     Reach Results DV
# 1.1     5       4  0
# 1.2     5       4  1
# 1.3     5       4  1
# 1.4     5       4  1
# 1.5     5       4  1
# 2.1     3       1  0
# 2.2     3       1  0
# 2.3     3       1  1

But really all you are doing is creating a vector of row indices and 0/1 values for DV, you could do that with rep

ii <- rep(1:nrow(dd), dd$Reach)

jj <- c(t(cbind(dd$Reach - dd$Results, dd$Results)))
dv <- rep(rep(0:1, nrow(dd)), jj)

within(dd[ii, ], {
  DV <- dv
})
#     Reach Results DV
# 1       5       4  0
# 1.1     5       4  1
# 1.2     5       4  1
# 1.3     5       4  1
# 1.4     5       4  1
# 2       3       1  0
# 2.1     3       1  0
# 2.2     3       1  1

Thanks! I am getting a message that using nr in rep_len(1L, nr) is an invalid 'length.out' value. Any idea what's happening there? I've confirmed nr is an integer — Erin Morrissey, Jan 26 '22 at 15:39
is reach ever negative? is reach always greater than results? — rawr, Jan 26 '22 at 15:41
reach is always greater than zero and always greater than results — Erin Morrissey, Jan 26 '22 at 16:00

How can I simplify this code (r) in which I am using information from an original data set to create a new dataset?

2 Answers2