0

I have relied on these posts 1 and 2 heavily to come up with the following closure code. The code works fine on a compressed xml of size 1.3 GB (actual size 13.5 GB). However, it takes roughly about 10 hours to get the final result. I have timed the code and the closure function takes up approximately 9.5 hours of the 10 hours (and so, I am only posting the relevant closure portion of the code). Given this, is there any way to further speed up this code? Can parallelization come to my rescue here? Here is a very small data sample.

UPDATE: Links to the 25% sample data and the 100% population.

library(XML)

branchFunction <- function() {
  store <- new.env() 
  func <- function(x, ...) {
    ns <- getNodeSet(x,path = "//person[@id]|//plan[@selected='yes']//*[not(self::route)]")
    value <- lapply(ns, xmlAttrs)
    id <- value[[1]]
    store[[id]] <- value
  }
  getStore <- function() { as.list(store) }
  list(person = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(file = "plansfull.xml", handlers = NULL, branches = myfunctions)

#to see what is inside
l <- myfunctions$getStore()
l <- unlist(l, recursive = FALSE)
Community
  • 1
  • 1
dataanalyst
  • 316
  • 3
  • 12
  • What does code profiling say? [Here](http://adv-r.had.co.nz/Profiling.html) are some examples of how to profile your code and find bottlenecks. It goes without saying that without the xml file, it is very hard for us to see and act on bottlenecks. – Roman Luštrik Mar 22 '16 at 07:13
  • Thanks for sending me the resource Roman. Although, I didn't know this resource beforehand, I used proc.time() to time the different portions in my code and the above piece takes the most time. Isn't this a good way to test the bottlenecks in code? I also want to point out that I have been working for some time now to read in this huge xml without any success until I found about the SAX parser. The above is the only code that worked on the entire dataset. None of my earlier versions (using DOM style) worked on the full dataset. – dataanalyst Mar 22 '16 at 15:53
  • Here are the [25 percent](https://www.dropbox.com/s/9wrnz7mku6xlzdw/100.plans.xml.gz?dl=0) and the [100 percent](https://www.dropbox.com/s/d8dtw10abbiuca6/300.plans.xml.gz?dl=0) datasets. The same code takes about 0.4 hours for the 25% sample but takes about 9.4 hours for the 100% population. – dataanalyst Mar 22 '16 at 15:57
  • Try `lineprof`, it also graphs bottlenecks. – Roman Luštrik Mar 22 '16 at 17:27
  • Ok. Let me try it. BTW, can you tell me if objects are grown in the function I posted above? I read that closures try to avoid growing objects but am not totally sure about their inner workings. – dataanalyst Mar 22 '16 at 18:00
  • @RomanLuštrik Can you let me know if the xmlEventParse can be parallelized or not? – dataanalyst Mar 23 '16 at 03:27
  • I'm not familiar with that function and would have to dig into the inner workings, which I currently cannot do due to time constraints, sorry. Hopefully someone else will be able to chip in. – Roman Luštrik Mar 23 '16 at 06:51
  • Did you remove the data from your DB folder? – tchakravarty Mar 25 '16 at 07:02
  • Yeah. I did remove it. I can upload if you want to give it a try. I just moved on to SAXON XSLT as R appears to be unable to handle large xml files. – dataanalyst Mar 29 '16 at 04:03

0 Answers0