I have relied on these posts 1 and 2 heavily to come up with the following closure code. The code works fine on a compressed xml of size 1.3 GB (actual size 13.5 GB). However, it takes roughly about 10 hours to get the final result. I have timed the code and the closure function takes up approximately 9.5 hours of the 10 hours (and so, I am only posting the relevant closure portion of the code). Given this, is there any way to further speed up this code? Can parallelization come to my rescue here? Here is a very small data sample.
UPDATE: Links to the 25% sample data and the 100% population.
library(XML)
branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x,path = "//person[@id]|//plan[@selected='yes']//*[not(self::route)]")
value <- lapply(ns, xmlAttrs)
id <- value[[1]]
store[[id]] <- value
}
getStore <- function() { as.list(store) }
list(person = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(file = "plansfull.xml", handlers = NULL, branches = myfunctions)
#to see what is inside
l <- myfunctions$getStore()
l <- unlist(l, recursive = FALSE)