Speed up unpacking aggregated data

Question

I have received aggregated data of 2.7M lines with 13 columns of aggregated data. I am trying to unpack it quickly. A shortened example of potential data and code is below:

columns types = sex, age, hair_color, eye_color, birth_state, survey_date, count

new_data <- tibble(sex = c("M", "M", "F", "F", "F"), 
         age = c(18, 27, 34, 21, 25),
         hair_color = c("Brown", "Blonde","Black", "Pink", "Blonde"),
         eye_color = c("Brown", "Blue","Black", "Green", "Green"),
         birth_state = c("AK", "CO","CO", "FL", "CA"),
         survey_date = as.POSIXct(c("1/1/2020", "5/1/2020","2/2/2020", "1/10/2020", "1/1/2020"),format = "%d/%m/%Y"),
         count = c(10,14,6,8,6))

The "10" in the last category of the first row, "count", demonstrates that there were 10 individuals counted which matched the preceding column information. I am trying to unpack the data set to have 10 lines all with the same preceding information rather than a single line with a "10" at the end.

Below is my current code:

i <- as.numeric(nrow(new_data))
check <- new_data

for (i in 1:i){
  k <- new_data[i,]
  j <- new_data[i,7]-1
  u <- data.frame(t(replicate(j, k, simplify = TRUE)))
  l <- list(check, u)
  check <- do.call("rbind", l)
  print(i)
}

check$cnts <- 1
end <- Sys.time()
start-end

Code thoughts:
i - find out how many total lines i will have to duplicate
k - pulling the line that I am going to duplicate
j - identifying how many times I need to repeat the line
u - establish a data.frame which uses replicate to generate multiple lines. I have to transpose it to get it into the right frame work.
l - create two lists to be joined.
check - to call "rbind()" to link the list together. This is providing me the best data.frame I have found yet.

check$cnts <- 1 just changes the count from whatever it was to 1. This indicates that it is a single survey point.

I have tried a handful of different methods and can't seem to find anything that runs quickly. I have been trying to unpack this for a couple days and it is still running. When I have tried other packages (rbind(), repeat(), etc) I end up with a data.frame which has lists embedded in it. Researching that problem, I seem to find that rbind() has problems at times.

I read another post about rbind() and it looks like I can't use multiple processors to speed things up. Any help would be greatly appreciated!

[Repeat each row of data.frame the number of times specified in a column](https://stackoverflow.com/questions/2894775/repeat-each-row-of-data-frame-the-number-of-times-specified-in-a-column) — Henrik, Jun 30 '20 at 18:04

Speed up unpacking aggregated data

0 Answers0