I have received aggregated data of 2.7M lines with 13 columns of aggregated data. I am trying to unpack it quickly. A shortened example of potential data and code is below:
columns types = sex, age, hair_color, eye_color, birth_state, survey_date, count
new_data <- tibble(sex = c("M", "M", "F", "F", "F"),
age = c(18, 27, 34, 21, 25),
hair_color = c("Brown", "Blonde","Black", "Pink", "Blonde"),
eye_color = c("Brown", "Blue","Black", "Green", "Green"),
birth_state = c("AK", "CO","CO", "FL", "CA"),
survey_date = as.POSIXct(c("1/1/2020", "5/1/2020","2/2/2020", "1/10/2020", "1/1/2020"),format = "%d/%m/%Y"),
count = c(10,14,6,8,6))
The "10" in the last category of the first row, "count", demonstrates that there were 10 individuals counted which matched the preceding column information. I am trying to unpack the data set to have 10 lines all with the same preceding information rather than a single line with a "10" at the end.
Below is my current code:
i <- as.numeric(nrow(new_data))
check <- new_data
for (i in 1:i){
k <- new_data[i,]
j <- new_data[i,7]-1
u <- data.frame(t(replicate(j, k, simplify = TRUE)))
l <- list(check, u)
check <- do.call("rbind", l)
print(i)
}
check$cnts <- 1
end <- Sys.time()
start-end
Code thoughts:
i - find out how many total lines i will have to duplicate
k - pulling the line that I am going to duplicate
j - identifying how many times I need to repeat the line
u - establish a data.frame which uses replicate to generate multiple lines. I have to transpose it to get it into the right frame work.
l - create two lists to be joined.
check - to call "rbind()" to link the list together. This is providing me the best data.frame I have found yet.
check$cnts <- 1 just changes the count from whatever it was to 1. This indicates that it is a single survey point.
I have tried a handful of different methods and can't seem to find anything that runs quickly. I have been trying to unpack this for a couple days and it is still running. When I have tried other packages (rbind(), repeat(), etc) I end up with a data.frame which has lists embedded in it. Researching that problem, I seem to find that rbind() has problems at times.
I read another post about rbind() and it looks like I can't use multiple processors to speed things up. Any help would be greatly appreciated!