7

I want to read data from large xml file (20 GB) and manipulate them. I tired to use "xmlParse()" but it gave me memory issue before loading. Is there any efficient way to do this?

My data dump looks like this,

<tags>                                                                                                    
    <row Id="106929" TagName="moto-360" Count="1"/>
    <row Id="106930" TagName="n1ql" Count="1"/>
    <row Id="106931" TagName="fable" Count="1" ExcerptPostId="25824355" WikiPostId="25824354"/>
    <row Id="106932" TagName="deeplearning4j" Count="1"/>
    <row Id="106933" TagName="pystache" Count="1"/>
    <row Id="106934" TagName="jitter" Count="1"/>
    <row Id="106935" TagName="klein-mvc" Count="1"/>
</tags>
Community
  • 1
  • 1
Karthick
  • 357
  • 4
  • 13
  • do you need the whole document tree in the workspace at once? otherwise you could read line after line – Verena Haunschmid Feb 23 '15 at 07:24
  • No need to load whole data at once. I may read line by line and process them? Or I may load the data as chunks and then process them. I would appreciate if you can give any suggestions. – Karthick Feb 23 '15 at 08:06
  • you could use the function readLines and set n to the number of lines you want to read. it should also be possible to use a SAX parser (the package you use provides it). I can add an example later (don't have R on this machine). maybe you can explain more what you want todo with your file, than it will be easier to provide a meaningful example. – Verena Haunschmid Feb 23 '15 at 08:15
  • Perhaps http://stackoverflow.com/questions/22643580/combine-values-in-huge-xml-files also helps. – hrbrmstr Feb 23 '15 at 12:38

1 Answers1

7

In XML package the xmlEventParse function implements SAX (reading XML and calling your function handlers). If your XML is simple enough (repeating elements inside one root element), you can use branches parameter to define function(s) for every element.

Example:

MedlineCitation = function(x, ...) {
  #This is a "branch" function
  #x is a XML node - everything inside element <MedlineCitation>
  # find element <ArticleTitle> inside and print it:
  ns <- getNodeSet(x,path = "//ArticleTitle")
  value <- xmlValue(ns[[1]])
  print(value)
}

Call XML parsing:

xmlEventParse(
  file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml", 
  handlers = NULL, 
  branches = list(MedlineCitation = MedlineCitation)
)

Solution with closure:

Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse:

branchFunction <- function() {
  store <- new.env() 
  func <- function(x, ...) {
    ns <- getNodeSet(x, path = "//ArticleTitle")
    value <- xmlValue(ns[[1]])
    print(value)
    # if storing something ... 
    # store[[some_key]] <- some_value
  }
  getStore <- function() { as.list(store) }
  list(MedlineCitation = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
  file = "medsamp2015.xml", 
  handlers = NULL, 
  branches = myfunctions
)

#to see what is inside
myfunctions$getStore()
Community
  • 1
  • 1
bergant
  • 7,122
  • 1
  • 20
  • 24
  • This works fine for me!. Is it possible to solve this problem using Hadoop in R? Let say i want to count each TagName. – Karthick Feb 24 '15 at 06:15
  • 1
    I am doing something similar as you mentioned. The problems is that It slows down with time. It was very quick to print out things at the beginning. I tried using "rm(list -ls())", but doesn't help even. – Karthick Feb 24 '15 at 08:44
  • Updated with a solution from M. Morgan (with a closure). – bergant Feb 24 '15 at 11:21
  • @Karthick did you ever figure out a way to do this without the memory slow-down? I'm experiencing the same thing. – km5041 Nov 08 '17 at 01:39
  • @Karthick did you ever figure out a way to do this without the memory slow-down? I'm experiencing the same thing – user1631306 May 09 '19 at 12:46